Re: orc vs parquet aggregation, orc is really slow

Maurin Lenglart Sun, 17 Apr 2016 15:01:01 -0700

Let me explain a little my architecture:
I have one cluster with hive and spark. Over there I create my databases and 
create the tables and insert data in them.
If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as 
`event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM 
myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' 
GROUP BY `event_date` LIMIT 20000”) using ORC, it take 15 sec


But then I export the tables to an other cluster where I don’t have hive. So I 
load my tables using
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
Then this query: self.sqlContext.sql(“SELECT `event_date` as 
`event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM 
myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' 
GROUP BY `event_date` LIMIT 20000”) take 50 seconds.

If I do the same process using parquet tables self.sqlContext.sql(“SELECT 
`event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as 
`dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` 
<= '2016-04-02' GROUP BY `event_date` LIMIT 20000”) take 8 seconds.


thanks

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:52 PM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

hang on so it takes 15 seconds to switch the database context with 
HiveContext.sql("use myDatabase") ?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 17 April 2016 at 22:44, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
The stats are only for one file in one partition. There is 17970737 rows in 
total.
The table is not bucketed.

The problem is not inserting rows, the problem is with this SQL query:

“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”

Benchmarks :

  *   8 seconds on parquet table loaded using  
sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   50 seconds on ORC using  
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   15 seconds on ORC using sqlContext(‘use myDatabase’)

The use case that I have is the second and slowest benchmark. Is there 
something I can do to speed that up?

thanks



From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:22 PM

To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Hi Maurin,

Have you tried to create your table in Hive as parquet table? This table is 
pretty small with 100K rows.

Is Hive table bucketed at all? I gather your issue is inserting rows into Hive 
table at the moment that taking longer time (compared to Parquet)?

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 17 April 2016 at 21:43, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I am using cloudera distribution, and when I do a" desc formatted table” I don 
t get all the table parameters.

But I did a hive orcfiledump on one random file ( I replaced some of the values 
that can be sensible) :
hive --orcfiledump 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The 
hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  
Continuing without it.
Structure for 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001
File Version: 0.12 with HIVE_8732
16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 with 
{include: null, offset: 0, length: 9223372036854775807}
Rows: 104260
Compression: ZLIB
Compression size: 262144
Type: struct<my table structure,the columns in the query have the idx 10 and 13 
and  >

Stripe Statistics:
  Stripe 1:
    Column 0: count: 104260 hasNull: false
    Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
    Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
    Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
    Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
    Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
    Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
    Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
    Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
    Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
    Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
    Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
    Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
    Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 
6958348.122699987
    Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 
sum: 1546639.6964531767
    Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
    Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 
sum: 45711.881143417035
    Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
    Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
    Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 
sum: 1042600

File Statistics:
  Column 0: count: 104260 hasNull: false
  Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
  Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
  Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: 738629
  Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
  Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
  Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
  Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
  Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
  Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
  Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
  Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
  Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
  Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 
6958348.122699987
  Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 
sum: 1546639.6964531767
  Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
  Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 sum: 
45711.881143417035
  Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
  Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
  Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 sum: 
1042600

Stripes:
  Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665
    Stream: column 0 section ROW_INDEX start: 3 length 21
    Stream: column 1 section ROW_INDEX start: 24 length 148
    Stream: column 2 section ROW_INDEX start: 172 length 160
    Stream: column 3 section ROW_INDEX start: 332 length 168
    Stream: column 4 section ROW_INDEX start: 500 length 133
    Stream: column 5 section ROW_INDEX start: 633 length 152
    Stream: column 6 section ROW_INDEX start: 785 length 141
    Stream: column 7 section ROW_INDEX start: 926 length 165
    Stream: column 8 section ROW_INDEX start: 1091 length 150
    Stream: column 9 section ROW_INDEX start: 1241 length 160
    Stream: column 10 section ROW_INDEX start: 1401 length 205
    Stream: column 11 section ROW_INDEX start: 1606 length 200
    Stream: column 12 section ROW_INDEX start: 1806 length 201
    Stream: column 13 section ROW_INDEX start: 2007 length 292
    Stream: column 14 section ROW_INDEX start: 2299 length 375
    Stream: column 15 section ROW_INDEX start: 2674 length 191
    Stream: column 16 section ROW_INDEX start: 2865 length 295
    Stream: column 17 section ROW_INDEX start: 3160 length 194
    Stream: column 18 section ROW_INDEX start: 3354 length 192
    Stream: column 19 section ROW_INDEX start: 3546 length 122
    Stream: column 1 section DATA start: 3668 length 33586
    Stream: column 1 section LENGTH start: 37254 length 7
    Stream: column 1 section DICTIONARY_DATA start: 37261 length 14
    Stream: column 2 section DATA start: 37275 length 40616
    Stream: column 2 section LENGTH start: 77891 length 8
    Stream: column 2 section DICTIONARY_DATA start: 77899 length 32
    Stream: column 3 section DATA start: 77931 length 46120
    Stream: column 3 section LENGTH start: 124051 length 17
    Stream: column 3 section DICTIONARY_DATA start: 124068 length 99
    Stream: column 4 section DATA start: 124167 length 26498
    Stream: column 5 section DATA start: 150665 length 38409
    Stream: column 5 section LENGTH start: 189074 length 8
    Stream: column 5 section DICTIONARY_DATA start: 189082 length 30
    Stream: column 6 section DATA start: 189112 length 9425
    Stream: column 6 section LENGTH start: 198537 length 9
    Stream: column 6 section DICTIONARY_DATA start: 198546 length 36
    Stream: column 7 section DATA start: 198582 length 95465
    Stream: column 7 section LENGTH start: 294047 length 127
    Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130
    Stream: column 8 section DATA start: 295304 length 43896
    Stream: column 8 section LENGTH start: 339200 length 16
    Stream: column 8 section DICTIONARY_DATA start: 339216 length 107
    Stream: column 9 section DATA start: 339323 length 42274
    Stream: column 9 section LENGTH start: 381597 length 9
    Stream: column 9 section DICTIONARY_DATA start: 381606 length 55
    Stream: column 10 section DATA start: 381661 length 89206
    Stream: column 11 section DATA start: 470867 length 46613
    Stream: column 12 section DATA start: 517480 length 62630
    Stream: column 13 section DATA start: 580110 length 103241
    Stream: column 14 section DATA start: 683351 length 138479
    Stream: column 15 section DATA start: 821830 length 34983
    Stream: column 16 section DATA start: 856813 length 7917
    Stream: column 17 section DATA start: 864730 length 47435
    Stream: column 18 section DATA start: 912165 length 462
    Stream: column 19 section DATA start: 912627 length 140
    Stream: column 19 section LENGTH start: 912767 length 6
    Stream: column 19 section DICTIONARY_DATA start: 912773 length 13
    Encoding column 0: DIRECT
    Encoding column 1: DICTIONARY_V2[3]
    Encoding column 2: DICTIONARY_V2[5]
    Encoding column 3: DICTIONARY_V2[12]
    Encoding column 4: DIRECT_V2
    Encoding column 5: DICTIONARY_V2[6]
    Encoding column 6: DICTIONARY_V2[8]
    Encoding column 7: DICTIONARY_V2[175]
    Encoding column 8: DICTIONARY_V2[11]
    Encoding column 9: DICTIONARY_V2[7]
    Encoding column 10: DIRECT_V2
    Encoding column 11: DIRECT_V2
    Encoding column 12: DIRECT
    Encoding column 13: DIRECT
    Encoding column 14: DIRECT
    Encoding column 15: DIRECT_V2
    Encoding column 16: DIRECT
    Encoding column 17: DIRECT_V2
    Encoding column 18: DIRECT_V2
    Encoding column 19: DICTIONARY_V2[1]

File length: 914171 bytes
Padding length: 0 bytes
Padding ratio: 0%


I also noticed something :
If I load a table like that : 
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
The queries done on myTable take at least twice the amount of time compared to 
queries done on the table loaded with hive directly.
For technical reasons my pipeline is not fully migrated to use hive tables, and 
in a lot of place I still manually load the table and register it.
I only see that problem with ORC format.
Do you see a solution so I could have the same performance on loaded tables?

thanks





From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 4:14 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>, 
"user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>

Subject: Re: orc vs parquet aggregation, orc is really slow

Apologies that should read

desc formatted <table_name>

Example for table dummy

hive> desc formatted  dummy;
OK
# col_name              data_type               comment
id                      int
clustered               int
scattered               int
randomised              int
random_string           varchar(50)
small_vc                varchar(10)
padding                 varchar(10)
# Detailed Table Information
Database:               test
Owner:                  hduser
CreateTime:             Sun Jan 31 06:09:56 GMT 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                35
        numRows                 100000000
        orc.bloom.filter.columns        ID
        orc.bloom.filter.fpp    0.05
        orc.compress            SNAPPY
        orc.create.index        true
        orc.row.index.stride    10000
        orc.stripe.size         16777216
        rawDataSize             0
        totalSize               5660813776
        transient_lastDdlTime   1458636871
# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            256
Bucket Columns:         [id]
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.065 seconds, Fetched: 42 row(s)

Store dates as dates.

To convert a string to date for a column like paymentdate (note in below UK 
date formatted converted to Hive yyyy-MM-dd)use the following:

        
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) 
AS paymentdate
HTH



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 16 April 2016 at 09:09, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I have : 17970737 rows
I tried to do a “desc formatted statistics myTable” but I get “Error while 
compiling statement: FAILED: SemanticException [Error 10001]: Table not found 
statistics”
Even after doing something like : “ANALYZE TABLE myTable COMPUTE STATISTICS FOR 
COLUMNS"

Thank you for your answer.

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 12:32 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Have you analysed statistics on the ORC table? How many rows are there?

Also send the outp of

desc formatted statistics <TABLE_NAME>

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 16 April 2016 at 08:20, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”

My table was created something like :
  CREATE TABLE myTable (
  bookings            DOUBLE
  , deal views          INT
  )
   STORED AS ORC or PARQUET
     PARTITION BY (event_date STRING)

PARQUET take 9second of cumulative CPU
ORC take 50second of cumulative CPU.

For ORC I have tried to 
hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
But it didn’t change anything

I am missing something, or parquet is better for this type of query?

I am using spark 1.6.0 with hive 1.1.0

thanks

Re: orc vs parquet aggregation, orc is really slow

Reply via email to