Re: orc vs parquet aggregation, orc is really slow

Rajesh Balamohan Sun, 17 Apr 2016 16:06:12 -0700

1. In first case (i.e in cluster where you have hive and spark), it would
have executed via HiveTableScan instead of OrcRelation. HiveTableScan would
not propagate any PPD related information to ORC readers (SPARK-12998). PPD
might not play a big role here as your where conditions seem to be only on
event_date partition.  Check if the time taken to launch itself is taking
time. In 1.6.x, HiveTableScan gets a threaddump for every reader.
SPARK-12898 might improve if that is the case. Also, SPARK-12925 can
improve perf if you are using lots of String columns.


2. In the second case (i.e in cluster where you have spark alone), can you
check if it is taking time launch the job or is it slow overall?. You can
possibly check the timing from spark-ui and compare it with the overall job
runtime,
- SPARK-12925 as mentioned in the first case can also impact perf in case
you have lots string columns.
- ORCrelation internally makes use of HadoopRDD. In case there are lots of
partitions created in your case, SPARK-14113 can have an impact as closure
cleaning is called every time an RDD is created. This has the potential to
slow down the overall job runtime depending on the number of partitions.
- OrcRelation suffers from callSite problem as well, so for every partition
it incurs the threaddump call which can impact depending on the number of
partitions being crated.
- Parquet caches lots of meta information. So the round trips to NN is
fairly low in case of Parq. So when you run the query second or third time,
Parq queries tend to run in lesser runtime. In ORCRelation, there is no
such meta caching and still incurs the NN cost.

~Rajesh.B

On Mon, Apr 18, 2016 at 3:29 AM, Maurin Lenglart <mau...@cuberonlabs.com>
wrote:

>
> Let me explain a little my architecture:
> I have one cluster with hive and spark. Over there I create my databases
> and create the tables and insert data in them.
> If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 20000”) using ORC, it take 15 sec
>
> But then I export the tables to an other cluster where I don’t have hive.
> So I load my tables using
>
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
> Then this query: self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 20000”) take 50 seconds.
>
> If I do the same process using parquet tables self.sqlContext.sql(“SELECT
> `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`)
> as `dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND
> `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”) take 8
> seconds.
>
>
> thanks
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Sunday, April 17, 2016 at 2:52 PM
>
> To: maurin lenglart <mau...@cuberonlabs.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> hang on so it takes 15 seconds to switch the database context with 
> HiveContext.sql("use
> myDatabase") ?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 22:44, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> The stats are only for one file in one partition. There is 17970737 rows
>> in total.
>> The table is not bucketed.
>>
>> The problem is not inserting rows, the problem is with this SQL query:
>>
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 20000”
>>
>> Benchmarks :
>>
>>    - 8 seconds on parquet table loaded using
>>     
>> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>    - 50 seconds on ORC
>>    using  
>> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>    - 15 seconds on ORC using sqlContext(‘use myDatabase’)
>>
>> The use case that I have is the second and slowest benchmark. Is there
>> something I can do to speed that up?
>>
>> thanks
>>
>>
>>
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Date: Sunday, April 17, 2016 at 2:22 PM
>>
>> To: maurin lenglart <mau...@cuberonlabs.com>
>> Cc: "user @spark" <user@spark.apache.org>
>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>
>> Hi Maurin,
>>
>> Have you tried to create your table in Hive as parquet table? This table
>> is pretty small with 100K rows.
>>
>> Is Hive table bucketed at all? I gather your issue is inserting rows into
>> Hive table at the moment that taking longer time (compared to Parquet)?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 21:43, Maurin Lenglart <mau...@cuberonlabs.com>
>> wrote:
>>
>>> Hi,
>>> I am using cloudera distribution, and when I do a" desc formatted table”
>>> I don t get all the table parameters.
>>>
>>> But I did a hive orcfiledump on one random file ( I replaced some of the
>>> values that can be sensible) :
>>> hive --orcfiledump
>>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
>>> 2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The
>>> hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
>>> Continuing without it.
>>> Structure for /user/hive/warehouse/myDB.db/mytable
>>> /event_date=2016-04-01/part-00001
>>> File Version: 0.12 with HIVE_8732
>>> 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from
>>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
>>> with {include: null, offset: 0, length: 9223372036854775807}
>>> Rows: 104260
>>> Compression: ZLIB
>>> Compression size: 262144
>>> Type: struct<my table structure,the columns in the query have the idx
>>> 10 and 13 and  >
>>>
>>> Stripe Statistics:
>>>   Stripe 1:
>>>     Column 0: count: 104260 hasNull: false
>>>     Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>>>     Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>>>     Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
>>>     Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>>>     Column 5: count: 104260 hasNull: false min: XXX max: Others sum:
>>> 262478
>>>     Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>>>     Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>>>     Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
>>>     Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
>>> 730846
>>>     Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum:
>>> 5481629
>>>     Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>>>     Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
>>> 324522.0
>>>     Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
>>> 6958348.122699987
>>>     Column 14: count: 104260 hasNull: false min: -2273.0 max:
>>> 39930.13977860418 sum: 1546639.6964531767
>>>     Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>>>     Column 16: count: 104260 hasNull: false min: 0.0 max:
>>> 4824.029119913681 sum: 45711.881143417035
>>>     Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>>>     Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>>>     Column 19: count: 104260 hasNull: false min: 2016-04-03 max:
>>> 2016-04-03 sum: 1042600
>>>
>>> File Statistics:
>>>   Column 0: count: 104260 hasNull: false
>>>   Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>>>   Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>>>   Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum:
>>> 738629
>>>   Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>>>   Column 5: count: 104260 hasNull: false min: XXX max: Others sum:
>>> 262478
>>>   Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>>>   Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>>>   Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
>>>   Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
>>> 730846
>>>   Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
>>>   Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>>>   Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
>>> 324522.0
>>>   Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
>>> 6958348.122699987
>>>   Column 14: count: 104260 hasNull: false min: -2273.0 max:
>>> 39930.13977860418 sum: 1546639.6964531767
>>>   Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>>>   Column 16: count: 104260 hasNull: false min: 0.0 max:
>>> 4824.029119913681 sum: 45711.881143417035
>>>   Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>>>   Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>>>   Column 19: count: 104260 hasNull: false min: 2016-04-03 max:
>>> 2016-04-03 sum: 1042600
>>>
>>> Stripes:
>>>   Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665
>>>     Stream: column 0 section ROW_INDEX start: 3 length 21
>>>     Stream: column 1 section ROW_INDEX start: 24 length 148
>>>     Stream: column 2 section ROW_INDEX start: 172 length 160
>>>     Stream: column 3 section ROW_INDEX start: 332 length 168
>>>     Stream: column 4 section ROW_INDEX start: 500 length 133
>>>     Stream: column 5 section ROW_INDEX start: 633 length 152
>>>     Stream: column 6 section ROW_INDEX start: 785 length 141
>>>     Stream: column 7 section ROW_INDEX start: 926 length 165
>>>     Stream: column 8 section ROW_INDEX start: 1091 length 150
>>>     Stream: column 9 section ROW_INDEX start: 1241 length 160
>>>     Stream: column 10 section ROW_INDEX start: 1401 length 205
>>>     Stream: column 11 section ROW_INDEX start: 1606 length 200
>>>     Stream: column 12 section ROW_INDEX start: 1806 length 201
>>>     Stream: column 13 section ROW_INDEX start: 2007 length 292
>>>     Stream: column 14 section ROW_INDEX start: 2299 length 375
>>>     Stream: column 15 section ROW_INDEX start: 2674 length 191
>>>     Stream: column 16 section ROW_INDEX start: 2865 length 295
>>>     Stream: column 17 section ROW_INDEX start: 3160 length 194
>>>     Stream: column 18 section ROW_INDEX start: 3354 length 192
>>>     Stream: column 19 section ROW_INDEX start: 3546 length 122
>>>     Stream: column 1 section DATA start: 3668 length 33586
>>>     Stream: column 1 section LENGTH start: 37254 length 7
>>>     Stream: column 1 section DICTIONARY_DATA start: 37261 length 14
>>>     Stream: column 2 section DATA start: 37275 length 40616
>>>     Stream: column 2 section LENGTH start: 77891 length 8
>>>     Stream: column 2 section DICTIONARY_DATA start: 77899 length 32
>>>     Stream: column 3 section DATA start: 77931 length 46120
>>>     Stream: column 3 section LENGTH start: 124051 length 17
>>>     Stream: column 3 section DICTIONARY_DATA start: 124068 length 99
>>>     Stream: column 4 section DATA start: 124167 length 26498
>>>     Stream: column 5 section DATA start: 150665 length 38409
>>>     Stream: column 5 section LENGTH start: 189074 length 8
>>>     Stream: column 5 section DICTIONARY_DATA start: 189082 length 30
>>>     Stream: column 6 section DATA start: 189112 length 9425
>>>     Stream: column 6 section LENGTH start: 198537 length 9
>>>     Stream: column 6 section DICTIONARY_DATA start: 198546 length 36
>>>     Stream: column 7 section DATA start: 198582 length 95465
>>>     Stream: column 7 section LENGTH start: 294047 length 127
>>>     Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130
>>>     Stream: column 8 section DATA start: 295304 length 43896
>>>     Stream: column 8 section LENGTH start: 339200 length 16
>>>     Stream: column 8 section DICTIONARY_DATA start: 339216 length 107
>>>     Stream: column 9 section DATA start: 339323 length 42274
>>>     Stream: column 9 section LENGTH start: 381597 length 9
>>>     Stream: column 9 section DICTIONARY_DATA start: 381606 length 55
>>>     Stream: column 10 section DATA start: 381661 length 89206
>>>     Stream: column 11 section DATA start: 470867 length 46613
>>>     Stream: column 12 section DATA start: 517480 length 62630
>>>     Stream: column 13 section DATA start: 580110 length 103241
>>>     Stream: column 14 section DATA start: 683351 length 138479
>>>     Stream: column 15 section DATA start: 821830 length 34983
>>>     Stream: column 16 section DATA start: 856813 length 7917
>>>     Stream: column 17 section DATA start: 864730 length 47435
>>>     Stream: column 18 section DATA start: 912165 length 462
>>>     Stream: column 19 section DATA start: 912627 length 140
>>>     Stream: column 19 section LENGTH start: 912767 length 6
>>>     Stream: column 19 section DICTIONARY_DATA start: 912773 length 13
>>>     Encoding column 0: DIRECT
>>>     Encoding column 1: DICTIONARY_V2[3]
>>>     Encoding column 2: DICTIONARY_V2[5]
>>>     Encoding column 3: DICTIONARY_V2[12]
>>>     Encoding column 4: DIRECT_V2
>>>     Encoding column 5: DICTIONARY_V2[6]
>>>     Encoding column 6: DICTIONARY_V2[8]
>>>     Encoding column 7: DICTIONARY_V2[175]
>>>     Encoding column 8: DICTIONARY_V2[11]
>>>     Encoding column 9: DICTIONARY_V2[7]
>>>     Encoding column 10: DIRECT_V2
>>>     Encoding column 11: DIRECT_V2
>>>     Encoding column 12: DIRECT
>>>     Encoding column 13: DIRECT
>>>     Encoding column 14: DIRECT
>>>     Encoding column 15: DIRECT_V2
>>>     Encoding column 16: DIRECT
>>>     Encoding column 17: DIRECT_V2
>>>     Encoding column 18: DIRECT_V2
>>>     Encoding column 19: DICTIONARY_V2[1]
>>>
>>> File length: 914171 bytes
>>> Padding length: 0 bytes
>>> Padding ratio: 0%
>>>
>>>
>>> I also noticed something :
>>> If I load a table like that :
>>> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>> The queries done on myTable take at least twice the amount of time
>>> compared to queries done on the table loaded with hive directly.
>>> For technical reasons my pipeline is not fully migrated to use hive
>>> tables, and in a lot of place I still manually load the table and register
>>> it.
>>> I only see that problem with ORC format.
>>> Do you see a solution so I could have the same performance on loaded
>>> tables?
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>>> Date: Saturday, April 16, 2016 at 4:14 AM
>>> To: maurin lenglart <mau...@cuberonlabs.com>, "user @spark" <
>>> user@spark.apache.org>
>>>
>>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>>
>>> Apologies that should read
>>>
>>> *desc formatted <table_name>*
>>>
>>> Example for table dummy
>>>
>>> hive> desc formatted  dummy;
>>> OK
>>> # col_name              data_type               comment
>>> id                      int
>>> clustered               int
>>> scattered               int
>>> randomised              int
>>> random_string           varchar(50)
>>> small_vc                varchar(10)
>>> padding                 varchar(10)
>>> # Detailed Table Information
>>> Database:               test
>>> Owner:                  hduser
>>> CreateTime:             Sun Jan 31 06:09:56 GMT 2016
>>> LastAccessTime:         UNKNOWN
>>> Retention:              0
>>> Location:
>>> hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
>>> Table Type:             MANAGED_TABLE
>>> Table Parameters:
>>>         COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
>>>         numFiles                35
>>>         numRows                 100000000
>>>         orc.bloom.filter.columns        ID
>>>         orc.bloom.filter.fpp    0.05
>>>         orc.compress            SNAPPY
>>>         orc.create.index        true
>>>         orc.row.index.stride    10000
>>>         orc.stripe.size         16777216
>>>         rawDataSize             0
>>>         totalSize               5660813776
>>>         transient_lastDdlTime   1458636871
>>> # Storage Information
>>> SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
>>> InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>>> OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>>> Compressed:             No
>>> Num Buckets:            256
>>> Bucket Columns:         [id]
>>> Sort Columns:           []
>>> Storage Desc Params:
>>>         serialization.format    1
>>> Time taken: 0.065 seconds, Fetched: 42 row(s)
>>>
>>> Store dates as dates.
>>>
>>> To convert a string to date for a column like paymentdate (note in below
>>> UK date formatted converted to Hive yyyy-MM-dd)use the following:
>>>
>>>         
>>> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
>>> AS paymentdate
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I have : 17970737 rows
>>>> I tried to do a “desc formatted statistics myTable” but I get “Error
>>>> while compiling statement: FAILED: SemanticException [Error 10001]: Table
>>>> not found statistics”
>>>> Even after doing something like : “ANALYZE TABLE myTable COMPUTE
>>>> STATISTICS FOR COLUMNS"
>>>>
>>>> Thank you for your answer.
>>>>
>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> Date: Saturday, April 16, 2016 at 12:32 AM
>>>> To: maurin lenglart <mau...@cuberonlabs.com>
>>>> Cc: "user @spark" <user@spark.apache.org>
>>>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>>>
>>>> Have you analysed statistics on the ORC table? How many rows are there?
>>>>
>>>> Also send the outp of
>>>>
>>>> desc formatted statistics <TABLE_NAME>
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I am executing one query :
>>>>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>>>>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  
>>>>> `event_date`
>>>>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>>>>> LIMIT 20000”
>>>>>
>>>>> My table was created something like :
>>>>>   CREATE TABLE myTable (
>>>>>   bookings            DOUBLE
>>>>>   , deal views          INT
>>>>>   )
>>>>>    STORED AS ORC or PARQUET
>>>>>      PARTITION BY (event_date STRING)
>>>>>
>>>>> PARQUET take 9second of cumulative CPU
>>>>> ORC take 50second of cumulative CPU.
>>>>>
>>>>> For ORC I have tried to hiveContext.setConf(“
>>>>> Spark.Sql.Orc.FilterPushdown”,“true”)
>>>>> But it didn’t change anything
>>>>>
>>>>> I am missing something, or parquet is better for this type of query?
>>>>>
>>>>> I am using spark 1.6.0 with hive 1.1.0
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
~Rajesh.B

Re: orc vs parquet aggregation, orc is really slow

Reply via email to