Hi,
I am using cloudera distribution, and when I do a" desc formatted table” I don
t get all the table parameters.
But I did a hive orcfiledump on one random file ( I replaced some of the values
that can be sensible) :
hive --orcfiledump
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
2016-04-17 01:36:15,424 WARN [main] mapreduce.TableMapReduceUtil: The
hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
Continuing without it.
Structure for
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
File Version: 0.12 with HIVE_8732
16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 with
{include: null, offset: 0, length: 9223372036854775807}
Rows: 104260
Compression: ZLIB
Compression size: 262144
Type: struct<my table structure,the columns in the query have the idx 10 and 13
and >
Stripe Statistics:
Stripe 1:
Column 0: count: 104260 hasNull: false
Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591
Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288
Column 8: count: 104260 hasNull: false min: max: XXX sum: 1657073
Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
6958348.122699987
Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418
sum: 1546639.6964531767
Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681
sum: 45711.881143417035
Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03
sum: 1042600
File Statistics:
Column 0: count: 104260 hasNull: false
Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: 738629
Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591
Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288
Column 8: count: 104260 hasNull: false min: max: Travel sum: 1657073
Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
6958348.122699987
Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418
sum: 1546639.6964531767
Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 sum:
45711.881143417035
Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 sum:
1042600
Stripes:
Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665
Stream: column 0 section ROW_INDEX start: 3 length 21
Stream: column 1 section ROW_INDEX start: 24 length 148
Stream: column 2 section ROW_INDEX start: 172 length 160
Stream: column 3 section ROW_INDEX start: 332 length 168
Stream: column 4 section ROW_INDEX start: 500 length 133
Stream: column 5 section ROW_INDEX start: 633 length 152
Stream: column 6 section ROW_INDEX start: 785 length 141
Stream: column 7 section ROW_INDEX start: 926 length 165
Stream: column 8 section ROW_INDEX start: 1091 length 150
Stream: column 9 section ROW_INDEX start: 1241 length 160
Stream: column 10 section ROW_INDEX start: 1401 length 205
Stream: column 11 section ROW_INDEX start: 1606 length 200
Stream: column 12 section ROW_INDEX start: 1806 length 201
Stream: column 13 section ROW_INDEX start: 2007 length 292
Stream: column 14 section ROW_INDEX start: 2299 length 375
Stream: column 15 section ROW_INDEX start: 2674 length 191
Stream: column 16 section ROW_INDEX start: 2865 length 295
Stream: column 17 section ROW_INDEX start: 3160 length 194
Stream: column 18 section ROW_INDEX start: 3354 length 192
Stream: column 19 section ROW_INDEX start: 3546 length 122
Stream: column 1 section DATA start: 3668 length 33586
Stream: column 1 section LENGTH start: 37254 length 7
Stream: column 1 section DICTIONARY_DATA start: 37261 length 14
Stream: column 2 section DATA start: 37275 length 40616
Stream: column 2 section LENGTH start: 77891 length 8
Stream: column 2 section DICTIONARY_DATA start: 77899 length 32
Stream: column 3 section DATA start: 77931 length 46120
Stream: column 3 section LENGTH start: 124051 length 17
Stream: column 3 section DICTIONARY_DATA start: 124068 length 99
Stream: column 4 section DATA start: 124167 length 26498
Stream: column 5 section DATA start: 150665 length 38409
Stream: column 5 section LENGTH start: 189074 length 8
Stream: column 5 section DICTIONARY_DATA start: 189082 length 30
Stream: column 6 section DATA start: 189112 length 9425
Stream: column 6 section LENGTH start: 198537 length 9
Stream: column 6 section DICTIONARY_DATA start: 198546 length 36
Stream: column 7 section DATA start: 198582 length 95465
Stream: column 7 section LENGTH start: 294047 length 127
Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130
Stream: column 8 section DATA start: 295304 length 43896
Stream: column 8 section LENGTH start: 339200 length 16
Stream: column 8 section DICTIONARY_DATA start: 339216 length 107
Stream: column 9 section DATA start: 339323 length 42274
Stream: column 9 section LENGTH start: 381597 length 9
Stream: column 9 section DICTIONARY_DATA start: 381606 length 55
Stream: column 10 section DATA start: 381661 length 89206
Stream: column 11 section DATA start: 470867 length 46613
Stream: column 12 section DATA start: 517480 length 62630
Stream: column 13 section DATA start: 580110 length 103241
Stream: column 14 section DATA start: 683351 length 138479
Stream: column 15 section DATA start: 821830 length 34983
Stream: column 16 section DATA start: 856813 length 7917
Stream: column 17 section DATA start: 864730 length 47435
Stream: column 18 section DATA start: 912165 length 462
Stream: column 19 section DATA start: 912627 length 140
Stream: column 19 section LENGTH start: 912767 length 6
Stream: column 19 section DICTIONARY_DATA start: 912773 length 13
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[3]
Encoding column 2: DICTIONARY_V2[5]
Encoding column 3: DICTIONARY_V2[12]
Encoding column 4: DIRECT_V2
Encoding column 5: DICTIONARY_V2[6]
Encoding column 6: DICTIONARY_V2[8]
Encoding column 7: DICTIONARY_V2[175]
Encoding column 8: DICTIONARY_V2[11]
Encoding column 9: DICTIONARY_V2[7]
Encoding column 10: DIRECT_V2
Encoding column 11: DIRECT_V2
Encoding column 12: DIRECT
Encoding column 13: DIRECT
Encoding column 14: DIRECT
Encoding column 15: DIRECT_V2
Encoding column 16: DIRECT
Encoding column 17: DIRECT_V2
Encoding column 18: DIRECT_V2
Encoding column 19: DICTIONARY_V2[1]
File length: 914171 bytes
Padding length: 0 bytes
Padding ratio: 0%
I also noticed something :
If I load a table like that :
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
The queries done on myTable take at least twice the amount of time compared to
queries done on the table loaded with hive directly.
For technical reasons my pipeline is not fully migrated to use hive tables, and
in a lot of place I still manually load the table and register it.
I only see that problem with ORC format.
Do you see a solution so I could have the same performance on loaded tables?
thanks
From: Mich Talebzadeh
<[email protected]<mailto:[email protected]>>
Date: Saturday, April 16, 2016 at 4:14 AM
To: maurin lenglart <[email protected]<mailto:[email protected]>>,
"user @spark" <[email protected]<mailto:[email protected]>>
Subject: Re: orc vs parquet aggregation, orc is really slow
Apologies that should read
desc formatted <table_name>
Example for table dummy
hive> desc formatted dummy;
OK
# col_name data_type comment
id int
clustered int
scattered int
randomised int
random_string varchar(50)
small_vc varchar(10)
padding varchar(10)
# Detailed Table Information
Database: test
Owner: hduser
CreateTime: Sun Jan 31 06:09:56 GMT 2016
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 35
numRows 100000000
orc.bloom.filter.columns ID
orc.bloom.filter.fpp 0.05
orc.compress SNAPPY
orc.create.index true
orc.row.index.stride 10000
orc.stripe.size 16777216
rawDataSize 0
totalSize 5660813776
transient_lastDdlTime 1458636871
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 256
Bucket Columns: [id]
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.065 seconds, Fetched: 42 row(s)
Store dates as dates.
To convert a string to date for a column like paymentdate (note in below UK
date formatted converted to Hive yyyy-MM-dd)use the following:
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
AS paymentdate
HTH
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 16 April 2016 at 09:09, Maurin Lenglart
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I have : 17970737 rows
I tried to do a “desc formatted statistics myTable” but I get “Error while
compiling statement: FAILED: SemanticException [Error 10001]: Table not found
statistics”
Even after doing something like : “ANALYZE TABLE myTable COMPUTE STATISTICS FOR
COLUMNS"
Thank you for your answer.
From: Mich Talebzadeh
<[email protected]<mailto:[email protected]>>
Date: Saturday, April 16, 2016 at 12:32 AM
To: maurin lenglart <[email protected]<mailto:[email protected]>>
Cc: "user @spark" <[email protected]<mailto:[email protected]>>
Subject: Re: orc vs parquet aggregation, orc is really slow
Have you analysed statistics on the ORC table? How many rows are there?
Also send the outp of
desc formatted statistics <TABLE_NAME>
HTH
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 16 April 2016 at 08:20, Maurin Lenglart
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >=
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”
My table was created something like :
CREATE TABLE myTable (
bookings DOUBLE
, deal views INT
)
STORED AS ORC or PARQUET
PARTITION BY (event_date STRING)
PARQUET take 9second of cumulative CPU
ORC take 50second of cumulative CPU.
For ORC I have tried to
hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
But it didn’t change anything
I am missing something, or parquet is better for this type of query?
I am using spark 1.6.0 with hive 1.1.0
thanks