Re: orc vs parquet aggregation, orc is really slow

Mich Talebzadeh Sun, 17 Apr 2016 14:52:56 -0700

hang on so it takes 15 seconds to switch the database context with
HiveContext.sql("use
myDatabase") ?


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:44, Maurin Lenglart <mau...@cuberonlabs.com> wrote:

> The stats are only for one file in one partition. There is 17970737 rows
> in total.
> The table is not bucketed.
>
> The problem is not inserting rows, the problem is with this SQL query:
>
> “SELECT `event_date` as `event_date`,sum(`bookings`) as
> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
> LIMIT 20000”
>
> Benchmarks :
>
>    - 8 seconds on parquet table loaded using
>     
> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>    - 50 seconds on ORC
>    using  
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>    - 15 seconds on ORC using sqlContext(‘use myDatabase’)
>
> The use case that I have is the second and slowest benchmark. Is there
> something I can do to speed that up?
>
> thanks
>
>
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Sunday, April 17, 2016 at 2:22 PM
>
> To: maurin lenglart <mau...@cuberonlabs.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Hi Maurin,
>
> Have you tried to create your table in Hive as parquet table? This table
> is pretty small with 100K rows.
>
> Is Hive table bucketed at all? I gather your issue is inserting rows into
> Hive table at the moment that taking longer time (compared to Parquet)?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 21:43, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> Hi,
>> I am using cloudera distribution, and when I do a" desc formatted table”
>> I don t get all the table parameters.
>>
>> But I did a hive orcfiledump on one random file ( I replaced some of the
>> values that can be sensible) :
>> hive --orcfiledump
>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
>> 2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The
>> hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
>> Continuing without it.
>> Structure for /user/hive/warehouse/myDB.db/mytable
>> /event_date=2016-04-01/part-00001
>> File Version: 0.12 with HIVE_8732
>> 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from
>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
>> with {include: null, offset: 0, length: 9223372036854775807}
>> Rows: 104260
>> Compression: ZLIB
>> Compression size: 262144
>> Type: struct<my table structure,the columns in the query have the idx 10
>> and 13 and  >
>>
>> Stripe Statistics:
>>   Stripe 1:
>>     Column 0: count: 104260 hasNull: false
>>     Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>>     Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>>     Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
>>     Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>>     Column 5: count: 104260 hasNull: false min: XXX max: Others sum:
>> 262478
>>     Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>>     Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>>     Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
>>     Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
>> 730846
>>     Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum:
>> 5481629
>>     Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>>     Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
>> 324522.0
>>     Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
>> 6958348.122699987
>>     Column 14: count: 104260 hasNull: false min: -2273.0 max:
>> 39930.13977860418 sum: 1546639.6964531767
>>     Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>>     Column 16: count: 104260 hasNull: false min: 0.0 max:
>> 4824.029119913681 sum: 45711.881143417035
>>     Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>>     Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>>     Column 19: count: 104260 hasNull: false min: 2016-04-03 max:
>> 2016-04-03 sum: 1042600
>>
>> File Statistics:
>>   Column 0: count: 104260 hasNull: false
>>   Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>>   Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>>   Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum:
>> 738629
>>   Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>>   Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
>>   Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>>   Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>>   Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
>>   Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
>> 730846
>>   Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
>>   Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>>   Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
>> 324522.0
>>   Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
>> 6958348.122699987
>>   Column 14: count: 104260 hasNull: false min: -2273.0 max:
>> 39930.13977860418 sum: 1546639.6964531767
>>   Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>>   Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681
>> sum: 45711.881143417035
>>   Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>>   Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>>   Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03
>> sum: 1042600
>>
>> Stripes:
>>   Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665
>>     Stream: column 0 section ROW_INDEX start: 3 length 21
>>     Stream: column 1 section ROW_INDEX start: 24 length 148
>>     Stream: column 2 section ROW_INDEX start: 172 length 160
>>     Stream: column 3 section ROW_INDEX start: 332 length 168
>>     Stream: column 4 section ROW_INDEX start: 500 length 133
>>     Stream: column 5 section ROW_INDEX start: 633 length 152
>>     Stream: column 6 section ROW_INDEX start: 785 length 141
>>     Stream: column 7 section ROW_INDEX start: 926 length 165
>>     Stream: column 8 section ROW_INDEX start: 1091 length 150
>>     Stream: column 9 section ROW_INDEX start: 1241 length 160
>>     Stream: column 10 section ROW_INDEX start: 1401 length 205
>>     Stream: column 11 section ROW_INDEX start: 1606 length 200
>>     Stream: column 12 section ROW_INDEX start: 1806 length 201
>>     Stream: column 13 section ROW_INDEX start: 2007 length 292
>>     Stream: column 14 section ROW_INDEX start: 2299 length 375
>>     Stream: column 15 section ROW_INDEX start: 2674 length 191
>>     Stream: column 16 section ROW_INDEX start: 2865 length 295
>>     Stream: column 17 section ROW_INDEX start: 3160 length 194
>>     Stream: column 18 section ROW_INDEX start: 3354 length 192
>>     Stream: column 19 section ROW_INDEX start: 3546 length 122
>>     Stream: column 1 section DATA start: 3668 length 33586
>>     Stream: column 1 section LENGTH start: 37254 length 7
>>     Stream: column 1 section DICTIONARY_DATA start: 37261 length 14
>>     Stream: column 2 section DATA start: 37275 length 40616
>>     Stream: column 2 section LENGTH start: 77891 length 8
>>     Stream: column 2 section DICTIONARY_DATA start: 77899 length 32
>>     Stream: column 3 section DATA start: 77931 length 46120
>>     Stream: column 3 section LENGTH start: 124051 length 17
>>     Stream: column 3 section DICTIONARY_DATA start: 124068 length 99
>>     Stream: column 4 section DATA start: 124167 length 26498
>>     Stream: column 5 section DATA start: 150665 length 38409
>>     Stream: column 5 section LENGTH start: 189074 length 8
>>     Stream: column 5 section DICTIONARY_DATA start: 189082 length 30
>>     Stream: column 6 section DATA start: 189112 length 9425
>>     Stream: column 6 section LENGTH start: 198537 length 9
>>     Stream: column 6 section DICTIONARY_DATA start: 198546 length 36
>>     Stream: column 7 section DATA start: 198582 length 95465
>>     Stream: column 7 section LENGTH start: 294047 length 127
>>     Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130
>>     Stream: column 8 section DATA start: 295304 length 43896
>>     Stream: column 8 section LENGTH start: 339200 length 16
>>     Stream: column 8 section DICTIONARY_DATA start: 339216 length 107
>>     Stream: column 9 section DATA start: 339323 length 42274
>>     Stream: column 9 section LENGTH start: 381597 length 9
>>     Stream: column 9 section DICTIONARY_DATA start: 381606 length 55
>>     Stream: column 10 section DATA start: 381661 length 89206
>>     Stream: column 11 section DATA start: 470867 length 46613
>>     Stream: column 12 section DATA start: 517480 length 62630
>>     Stream: column 13 section DATA start: 580110 length 103241
>>     Stream: column 14 section DATA start: 683351 length 138479
>>     Stream: column 15 section DATA start: 821830 length 34983
>>     Stream: column 16 section DATA start: 856813 length 7917
>>     Stream: column 17 section DATA start: 864730 length 47435
>>     Stream: column 18 section DATA start: 912165 length 462
>>     Stream: column 19 section DATA start: 912627 length 140
>>     Stream: column 19 section LENGTH start: 912767 length 6
>>     Stream: column 19 section DICTIONARY_DATA start: 912773 length 13
>>     Encoding column 0: DIRECT
>>     Encoding column 1: DICTIONARY_V2[3]
>>     Encoding column 2: DICTIONARY_V2[5]
>>     Encoding column 3: DICTIONARY_V2[12]
>>     Encoding column 4: DIRECT_V2
>>     Encoding column 5: DICTIONARY_V2[6]
>>     Encoding column 6: DICTIONARY_V2[8]
>>     Encoding column 7: DICTIONARY_V2[175]
>>     Encoding column 8: DICTIONARY_V2[11]
>>     Encoding column 9: DICTIONARY_V2[7]
>>     Encoding column 10: DIRECT_V2
>>     Encoding column 11: DIRECT_V2
>>     Encoding column 12: DIRECT
>>     Encoding column 13: DIRECT
>>     Encoding column 14: DIRECT
>>     Encoding column 15: DIRECT_V2
>>     Encoding column 16: DIRECT
>>     Encoding column 17: DIRECT_V2
>>     Encoding column 18: DIRECT_V2
>>     Encoding column 19: DICTIONARY_V2[1]
>>
>> File length: 914171 bytes
>> Padding length: 0 bytes
>> Padding ratio: 0%
>>
>>
>> I also noticed something :
>> If I load a table like that :
>> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>> The queries done on myTable take at least twice the amount of time
>> compared to queries done on the table loaded with hive directly.
>> For technical reasons my pipeline is not fully migrated to use hive
>> tables, and in a lot of place I still manually load the table and register
>> it.
>> I only see that problem with ORC format.
>> Do you see a solution so I could have the same performance on loaded
>> tables?
>>
>> thanks
>>
>>
>>
>>
>>
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Date: Saturday, April 16, 2016 at 4:14 AM
>> To: maurin lenglart <mau...@cuberonlabs.com>, "user @spark" <
>> user@spark.apache.org>
>>
>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>
>> Apologies that should read
>>
>> *desc formatted <table_name>*
>>
>> Example for table dummy
>>
>> hive> desc formatted  dummy;
>> OK
>> # col_name              data_type               comment
>> id                      int
>> clustered               int
>> scattered               int
>> randomised              int
>> random_string           varchar(50)
>> small_vc                varchar(10)
>> padding                 varchar(10)
>> # Detailed Table Information
>> Database:               test
>> Owner:                  hduser
>> CreateTime:             Sun Jan 31 06:09:56 GMT 2016
>> LastAccessTime:         UNKNOWN
>> Retention:              0
>> Location:
>> hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
>> Table Type:             MANAGED_TABLE
>> Table Parameters:
>>         COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
>>         numFiles                35
>>         numRows                 100000000
>>         orc.bloom.filter.columns        ID
>>         orc.bloom.filter.fpp    0.05
>>         orc.compress            SNAPPY
>>         orc.create.index        true
>>         orc.row.index.stride    10000
>>         orc.stripe.size         16777216
>>         rawDataSize             0
>>         totalSize               5660813776
>>         transient_lastDdlTime   1458636871
>> # Storage Information
>> SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
>> InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>> OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>> Compressed:             No
>> Num Buckets:            256
>> Bucket Columns:         [id]
>> Sort Columns:           []
>> Storage Desc Params:
>>         serialization.format    1
>> Time taken: 0.065 seconds, Fetched: 42 row(s)
>>
>> Store dates as dates.
>>
>> To convert a string to date for a column like paymentdate (note in below
>> UK date formatted converted to Hive yyyy-MM-dd)use the following:
>>
>>         
>> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
>> AS paymentdate
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com>
>> wrote:
>>
>>> Hi,
>>> I have : 17970737 rows
>>> I tried to do a “desc formatted statistics myTable” but I get “Error
>>> while compiling statement: FAILED: SemanticException [Error 10001]: Table
>>> not found statistics”
>>> Even after doing something like : “ANALYZE TABLE myTable COMPUTE
>>> STATISTICS FOR COLUMNS"
>>>
>>> Thank you for your answer.
>>>
>>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>>> Date: Saturday, April 16, 2016 at 12:32 AM
>>> To: maurin lenglart <mau...@cuberonlabs.com>
>>> Cc: "user @spark" <user@spark.apache.org>
>>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>>
>>> Have you analysed statistics on the ORC table? How many rows are there?
>>>
>>> Also send the outp of
>>>
>>> desc formatted statistics <TABLE_NAME>
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I am executing one query :
>>>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>>>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>>>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>>>> LIMIT 20000”
>>>>
>>>> My table was created something like :
>>>>   CREATE TABLE myTable (
>>>>   bookings            DOUBLE
>>>>   , deal views          INT
>>>>   )
>>>>    STORED AS ORC or PARQUET
>>>>      PARTITION BY (event_date STRING)
>>>>
>>>> PARQUET take 9second of cumulative CPU
>>>> ORC take 50second of cumulative CPU.
>>>>
>>>> For ORC I have tried to hiveContext.setConf(“
>>>> Spark.Sql.Orc.FilterPushdown”,“true”)
>>>> But it didn’t change anything
>>>>
>>>> I am missing something, or parquet is better for this type of query?
>>>>
>>>> I am using spark 1.6.0 with hive 1.1.0
>>>>
>>>> thanks
>>>>
>>>>
>>>>
>>>
>>
>

Re: orc vs parquet aggregation, orc is really slow

Reply via email to