Re: orc vs parquet aggregation, orc is really slow

Mich Talebzadeh Sun, 17 Apr 2016 14:22:42 -0700

Hi Maurin,

Have you tried to create your table in Hive as parquet table? This table is
pretty small with 100K rows.


Is Hive table bucketed at all? I gather your issue is inserting rows into
Hive table at the moment that taking longer time (compared to Parquet)?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 21:43, Maurin Lenglart <mau...@cuberonlabs.com> wrote:

> Hi,
> I am using cloudera distribution, and when I do a" desc formatted table” I
> don t get all the table parameters.
>
> But I did a hive orcfiledump on one random file ( I replaced some of the
> values that can be sensible) :
> hive --orcfiledump
> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
> 2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The
> hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
> Continuing without it.
> Structure for /user/hive/warehouse/myDB.db/mytable
> /event_date=2016-04-01/part-00001
> File Version: 0.12 with HIVE_8732
> 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from
> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
> with {include: null, offset: 0, length: 9223372036854775807}
> Rows: 104260
> Compression: ZLIB
> Compression size: 262144
> Type: struct<my table structure,the columns in the query have the idx 10
> and 13 and  >
>
> Stripe Statistics:
>   Stripe 1:
>     Column 0: count: 104260 hasNull: false
>     Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>     Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>     Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
>     Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>     Column 5: count: 104260 hasNull: false min: XXX max: Others sum:
> 262478
>     Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>     Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>     Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
>     Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
> 730846
>     Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
>     Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>     Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
> 324522.0
>     Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
> 6958348.122699987
>     Column 14: count: 104260 hasNull: false min: -2273.0 max:
> 39930.13977860418 sum: 1546639.6964531767
>     Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>     Column 16: count: 104260 hasNull: false min: 0.0 max:
> 4824.029119913681 sum: 45711.881143417035
>     Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>     Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>     Column 19: count: 104260 hasNull: false min: 2016-04-03 max:
> 2016-04-03 sum: 1042600
>
> File Statistics:
>   Column 0: count: 104260 hasNull: false
>   Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>   Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>   Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum:
> 738629
>   Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>   Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
>   Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>   Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>   Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
>   Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
>   Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
>   Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>   Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
> 324522.0
>   Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
> 6958348.122699987
>   Column 14: count: 104260 hasNull: false min: -2273.0 max:
> 39930.13977860418 sum: 1546639.6964531767
>   Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>   Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681
> sum: 45711.881143417035
>   Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>   Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>   Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03
> sum: 1042600
>
> Stripes:
>   Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665
>     Stream: column 0 section ROW_INDEX start: 3 length 21
>     Stream: column 1 section ROW_INDEX start: 24 length 148
>     Stream: column 2 section ROW_INDEX start: 172 length 160
>     Stream: column 3 section ROW_INDEX start: 332 length 168
>     Stream: column 4 section ROW_INDEX start: 500 length 133
>     Stream: column 5 section ROW_INDEX start: 633 length 152
>     Stream: column 6 section ROW_INDEX start: 785 length 141
>     Stream: column 7 section ROW_INDEX start: 926 length 165
>     Stream: column 8 section ROW_INDEX start: 1091 length 150
>     Stream: column 9 section ROW_INDEX start: 1241 length 160
>     Stream: column 10 section ROW_INDEX start: 1401 length 205
>     Stream: column 11 section ROW_INDEX start: 1606 length 200
>     Stream: column 12 section ROW_INDEX start: 1806 length 201
>     Stream: column 13 section ROW_INDEX start: 2007 length 292
>     Stream: column 14 section ROW_INDEX start: 2299 length 375
>     Stream: column 15 section ROW_INDEX start: 2674 length 191
>     Stream: column 16 section ROW_INDEX start: 2865 length 295
>     Stream: column 17 section ROW_INDEX start: 3160 length 194
>     Stream: column 18 section ROW_INDEX start: 3354 length 192
>     Stream: column 19 section ROW_INDEX start: 3546 length 122
>     Stream: column 1 section DATA start: 3668 length 33586
>     Stream: column 1 section LENGTH start: 37254 length 7
>     Stream: column 1 section DICTIONARY_DATA start: 37261 length 14
>     Stream: column 2 section DATA start: 37275 length 40616
>     Stream: column 2 section LENGTH start: 77891 length 8
>     Stream: column 2 section DICTIONARY_DATA start: 77899 length 32
>     Stream: column 3 section DATA start: 77931 length 46120
>     Stream: column 3 section LENGTH start: 124051 length 17
>     Stream: column 3 section DICTIONARY_DATA start: 124068 length 99
>     Stream: column 4 section DATA start: 124167 length 26498
>     Stream: column 5 section DATA start: 150665 length 38409
>     Stream: column 5 section LENGTH start: 189074 length 8
>     Stream: column 5 section DICTIONARY_DATA start: 189082 length 30
>     Stream: column 6 section DATA start: 189112 length 9425
>     Stream: column 6 section LENGTH start: 198537 length 9
>     Stream: column 6 section DICTIONARY_DATA start: 198546 length 36
>     Stream: column 7 section DATA start: 198582 length 95465
>     Stream: column 7 section LENGTH start: 294047 length 127
>     Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130
>     Stream: column 8 section DATA start: 295304 length 43896
>     Stream: column 8 section LENGTH start: 339200 length 16
>     Stream: column 8 section DICTIONARY_DATA start: 339216 length 107
>     Stream: column 9 section DATA start: 339323 length 42274
>     Stream: column 9 section LENGTH start: 381597 length 9
>     Stream: column 9 section DICTIONARY_DATA start: 381606 length 55
>     Stream: column 10 section DATA start: 381661 length 89206
>     Stream: column 11 section DATA start: 470867 length 46613
>     Stream: column 12 section DATA start: 517480 length 62630
>     Stream: column 13 section DATA start: 580110 length 103241
>     Stream: column 14 section DATA start: 683351 length 138479
>     Stream: column 15 section DATA start: 821830 length 34983
>     Stream: column 16 section DATA start: 856813 length 7917
>     Stream: column 17 section DATA start: 864730 length 47435
>     Stream: column 18 section DATA start: 912165 length 462
>     Stream: column 19 section DATA start: 912627 length 140
>     Stream: column 19 section LENGTH start: 912767 length 6
>     Stream: column 19 section DICTIONARY_DATA start: 912773 length 13
>     Encoding column 0: DIRECT
>     Encoding column 1: DICTIONARY_V2[3]
>     Encoding column 2: DICTIONARY_V2[5]
>     Encoding column 3: DICTIONARY_V2[12]
>     Encoding column 4: DIRECT_V2
>     Encoding column 5: DICTIONARY_V2[6]
>     Encoding column 6: DICTIONARY_V2[8]
>     Encoding column 7: DICTIONARY_V2[175]
>     Encoding column 8: DICTIONARY_V2[11]
>     Encoding column 9: DICTIONARY_V2[7]
>     Encoding column 10: DIRECT_V2
>     Encoding column 11: DIRECT_V2
>     Encoding column 12: DIRECT
>     Encoding column 13: DIRECT
>     Encoding column 14: DIRECT
>     Encoding column 15: DIRECT_V2
>     Encoding column 16: DIRECT
>     Encoding column 17: DIRECT_V2
>     Encoding column 18: DIRECT_V2
>     Encoding column 19: DICTIONARY_V2[1]
>
> File length: 914171 bytes
> Padding length: 0 bytes
> Padding ratio: 0%
>
>
> I also noticed something :
> If I load a table like that :
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
> The queries done on myTable take at least twice the amount of time
> compared to queries done on the table loaded with hive directly.
> For technical reasons my pipeline is not fully migrated to use hive
> tables, and in a lot of place I still manually load the table and register
> it.
> I only see that problem with ORC format.
> Do you see a solution so I could have the same performance on loaded
> tables?
>
> thanks
>
>
>
>
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Saturday, April 16, 2016 at 4:14 AM
> To: maurin lenglart <mau...@cuberonlabs.com>, "user @spark" <
> user@spark.apache.org>
>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Apologies that should read
>
> *desc formatted <table_name>*
>
> Example for table dummy
>
> hive> desc formatted  dummy;
> OK
> # col_name              data_type               comment
> id                      int
> clustered               int
> scattered               int
> randomised              int
> random_string           varchar(50)
> small_vc                varchar(10)
> padding                 varchar(10)
> # Detailed Table Information
> Database:               test
> Owner:                  hduser
> CreateTime:             Sun Jan 31 06:09:56 GMT 2016
> LastAccessTime:         UNKNOWN
> Retention:              0
> Location:
> hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
> Table Type:             MANAGED_TABLE
> Table Parameters:
>         COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
>         numFiles                35
>         numRows                 100000000
>         orc.bloom.filter.columns        ID
>         orc.bloom.filter.fpp    0.05
>         orc.compress            SNAPPY
>         orc.create.index        true
>         orc.row.index.stride    10000
>         orc.stripe.size         16777216
>         rawDataSize             0
>         totalSize               5660813776
>         transient_lastDdlTime   1458636871
> # Storage Information
> SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> Compressed:             No
> Num Buckets:            256
> Bucket Columns:         [id]
> Sort Columns:           []
> Storage Desc Params:
>         serialization.format    1
> Time taken: 0.065 seconds, Fetched: 42 row(s)
>
> Store dates as dates.
>
> To convert a string to date for a column like paymentdate (note in below
> UK date formatted converted to Hive yyyy-MM-dd)use the following:
>
>         
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
> AS paymentdate
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> Hi,
>> I have : 17970737 rows
>> I tried to do a “desc formatted statistics myTable” but I get “Error
>> while compiling statement: FAILED: SemanticException [Error 10001]: Table
>> not found statistics”
>> Even after doing something like : “ANALYZE TABLE myTable COMPUTE
>> STATISTICS FOR COLUMNS"
>>
>> Thank you for your answer.
>>
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Date: Saturday, April 16, 2016 at 12:32 AM
>> To: maurin lenglart <mau...@cuberonlabs.com>
>> Cc: "user @spark" <user@spark.apache.org>
>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>
>> Have you analysed statistics on the ORC table? How many rows are there?
>>
>> Also send the outp of
>>
>> desc formatted statistics <TABLE_NAME>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com>
>> wrote:
>>
>>> Hi,
>>> I am executing one query :
>>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>>> LIMIT 20000”
>>>
>>> My table was created something like :
>>>   CREATE TABLE myTable (
>>>   bookings            DOUBLE
>>>   , deal views          INT
>>>   )
>>>    STORED AS ORC or PARQUET
>>>      PARTITION BY (event_date STRING)
>>>
>>> PARQUET take 9second of cumulative CPU
>>> ORC take 50second of cumulative CPU.
>>>
>>> For ORC I have tried to hiveContext.setConf(“
>>> Spark.Sql.Orc.FilterPushdown”,“true”)
>>> But it didn’t change anything
>>>
>>> I am missing something, or parquet is better for this type of query?
>>>
>>> I am using spark 1.6.0 with hive 1.1.0
>>>
>>> thanks
>>>
>>>
>>>
>>
>

Re: orc vs parquet aggregation, orc is really slow

Reply via email to