Let me explain a little my architecture: I have one cluster with hive and spark. Over there I create my databases and create the tables and insert data in them. If I execute this query :self.sqlContext.sql(“SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”) using ORC, it take 15 sec
But then I export the tables to an other cluster where I don’t have hive. So I load my tables using sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’) Then this query: self.sqlContext.sql(“SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”) take 50 seconds. If I do the same process using parquet tables self.sqlContext.sql(“SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”) take 8 seconds. thanks From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Sunday, April 17, 2016 at 2:52 PM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow hang on so it takes 15 seconds to switch the database context with HiveContext.sql("use myDatabase") ? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 17 April 2016 at 22:44, Maurin Lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote: The stats are only for one file in one partition. There is 17970737 rows in total. The table is not bucketed. The problem is not inserting rows, the problem is with this SQL query: “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000” Benchmarks : * 8 seconds on parquet table loaded using sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’) * 50 seconds on ORC using sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’) * 15 seconds on ORC using sqlContext(‘use myDatabase’) The use case that I have is the second and slowest benchmark. Is there something I can do to speed that up? thanks From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Sunday, April 17, 2016 at 2:22 PM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow Hi Maurin, Have you tried to create your table in Hive as parquet table? This table is pretty small with 100K rows. Is Hive table bucketed at all? I gather your issue is inserting rows into Hive table at the moment that taking longer time (compared to Parquet)? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 17 April 2016 at 21:43, Maurin Lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote: Hi, I am using cloudera distribution, and when I do a" desc formatted table” I don t get all the table parameters. But I did a hive orcfiledump on one random file ( I replaced some of the values that can be sensible) : hive --orcfiledump /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 2016-04-17 01:36:15,424 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it. Structure for /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 File Version: 0.12 with HIVE_8732 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 with {include: null, offset: 0, length: 9223372036854775807} Rows: 104260 Compression: ZLIB Compression size: 262144 Type: struct<my table structure,the columns in the query have the idx 10 and 13 and > Stripe Statistics: Stripe 1: Column 0: count: 104260 hasNull: false Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456 Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322 Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629 Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221 Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478 Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591 Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288 Column 8: count: 104260 hasNull: false min: max: XXX sum: 1657073 Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846 Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629 Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946 Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0 Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 6958348.122699987 Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 sum: 1546639.6964531767 Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814 Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 sum: 45711.881143417035 Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866 Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864 Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 sum: 1042600 File Statistics: Column 0: count: 104260 hasNull: false Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456 Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322 Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: 738629 Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221 Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478 Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591 Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288 Column 8: count: 104260 hasNull: false min: max: Travel sum: 1657073 Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846 Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629 Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946 Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0 Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 6958348.122699987 Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 sum: 1546639.6964531767 Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814 Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 sum: 45711.881143417035 Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866 Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864 Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 sum: 1042600 Stripes: Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665 Stream: column 0 section ROW_INDEX start: 3 length 21 Stream: column 1 section ROW_INDEX start: 24 length 148 Stream: column 2 section ROW_INDEX start: 172 length 160 Stream: column 3 section ROW_INDEX start: 332 length 168 Stream: column 4 section ROW_INDEX start: 500 length 133 Stream: column 5 section ROW_INDEX start: 633 length 152 Stream: column 6 section ROW_INDEX start: 785 length 141 Stream: column 7 section ROW_INDEX start: 926 length 165 Stream: column 8 section ROW_INDEX start: 1091 length 150 Stream: column 9 section ROW_INDEX start: 1241 length 160 Stream: column 10 section ROW_INDEX start: 1401 length 205 Stream: column 11 section ROW_INDEX start: 1606 length 200 Stream: column 12 section ROW_INDEX start: 1806 length 201 Stream: column 13 section ROW_INDEX start: 2007 length 292 Stream: column 14 section ROW_INDEX start: 2299 length 375 Stream: column 15 section ROW_INDEX start: 2674 length 191 Stream: column 16 section ROW_INDEX start: 2865 length 295 Stream: column 17 section ROW_INDEX start: 3160 length 194 Stream: column 18 section ROW_INDEX start: 3354 length 192 Stream: column 19 section ROW_INDEX start: 3546 length 122 Stream: column 1 section DATA start: 3668 length 33586 Stream: column 1 section LENGTH start: 37254 length 7 Stream: column 1 section DICTIONARY_DATA start: 37261 length 14 Stream: column 2 section DATA start: 37275 length 40616 Stream: column 2 section LENGTH start: 77891 length 8 Stream: column 2 section DICTIONARY_DATA start: 77899 length 32 Stream: column 3 section DATA start: 77931 length 46120 Stream: column 3 section LENGTH start: 124051 length 17 Stream: column 3 section DICTIONARY_DATA start: 124068 length 99 Stream: column 4 section DATA start: 124167 length 26498 Stream: column 5 section DATA start: 150665 length 38409 Stream: column 5 section LENGTH start: 189074 length 8 Stream: column 5 section DICTIONARY_DATA start: 189082 length 30 Stream: column 6 section DATA start: 189112 length 9425 Stream: column 6 section LENGTH start: 198537 length 9 Stream: column 6 section DICTIONARY_DATA start: 198546 length 36 Stream: column 7 section DATA start: 198582 length 95465 Stream: column 7 section LENGTH start: 294047 length 127 Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130 Stream: column 8 section DATA start: 295304 length 43896 Stream: column 8 section LENGTH start: 339200 length 16 Stream: column 8 section DICTIONARY_DATA start: 339216 length 107 Stream: column 9 section DATA start: 339323 length 42274 Stream: column 9 section LENGTH start: 381597 length 9 Stream: column 9 section DICTIONARY_DATA start: 381606 length 55 Stream: column 10 section DATA start: 381661 length 89206 Stream: column 11 section DATA start: 470867 length 46613 Stream: column 12 section DATA start: 517480 length 62630 Stream: column 13 section DATA start: 580110 length 103241 Stream: column 14 section DATA start: 683351 length 138479 Stream: column 15 section DATA start: 821830 length 34983 Stream: column 16 section DATA start: 856813 length 7917 Stream: column 17 section DATA start: 864730 length 47435 Stream: column 18 section DATA start: 912165 length 462 Stream: column 19 section DATA start: 912627 length 140 Stream: column 19 section LENGTH start: 912767 length 6 Stream: column 19 section DICTIONARY_DATA start: 912773 length 13 Encoding column 0: DIRECT Encoding column 1: DICTIONARY_V2[3] Encoding column 2: DICTIONARY_V2[5] Encoding column 3: DICTIONARY_V2[12] Encoding column 4: DIRECT_V2 Encoding column 5: DICTIONARY_V2[6] Encoding column 6: DICTIONARY_V2[8] Encoding column 7: DICTIONARY_V2[175] Encoding column 8: DICTIONARY_V2[11] Encoding column 9: DICTIONARY_V2[7] Encoding column 10: DIRECT_V2 Encoding column 11: DIRECT_V2 Encoding column 12: DIRECT Encoding column 13: DIRECT Encoding column 14: DIRECT Encoding column 15: DIRECT_V2 Encoding column 16: DIRECT Encoding column 17: DIRECT_V2 Encoding column 18: DIRECT_V2 Encoding column 19: DICTIONARY_V2[1] File length: 914171 bytes Padding length: 0 bytes Padding ratio: 0% I also noticed something : If I load a table like that : sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’) The queries done on myTable take at least twice the amount of time compared to queries done on the table loaded with hive directly. For technical reasons my pipeline is not fully migrated to use hive tables, and in a lot of place I still manually load the table and register it. I only see that problem with ORC format. Do you see a solution so I could have the same performance on loaded tables? thanks From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Saturday, April 16, 2016 at 4:14 AM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>, "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow Apologies that should read desc formatted <table_name> Example for table dummy hive> desc formatted dummy; OK # col_name data_type comment id int clustered int scattered int randomised int random_string varchar(50) small_vc varchar(10) padding varchar(10) # Detailed Table Information Database: test Owner: hduser CreateTime: Sun Jan 31 06:09:56 GMT 2016 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 35 numRows 100000000 orc.bloom.filter.columns ID orc.bloom.filter.fpp 0.05 orc.compress SNAPPY orc.create.index true orc.row.index.stride 10000 orc.stripe.size 16777216 rawDataSize 0 totalSize 5660813776 transient_lastDdlTime 1458636871 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [id] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.065 seconds, Fetched: 42 row(s) Store dates as dates. To convert a string to date for a column like paymentdate (note in below UK date formatted converted to Hive yyyy-MM-dd)use the following: TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) AS paymentdate HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote: Hi, I have : 17970737 rows I tried to do a “desc formatted statistics myTable” but I get “Error while compiling statement: FAILED: SemanticException [Error 10001]: Table not found statistics” Even after doing something like : “ANALYZE TABLE myTable COMPUTE STATISTICS FOR COLUMNS" Thank you for your answer. From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Saturday, April 16, 2016 at 12:32 AM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow Have you analysed statistics on the ORC table? How many rows are there? Also send the outp of desc formatted statistics <TABLE_NAME> HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote: Hi, I am executing one query : “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000” My table was created something like : CREATE TABLE myTable ( bookings DOUBLE , deal views INT ) STORED AS ORC or PARQUET PARTITION BY (event_date STRING) PARQUET take 9second of cumulative CPU ORC take 50second of cumulative CPU. For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”) But it didn’t change anything I am missing something, or parquet is better for this type of query? I am using spark 1.6.0 with hive 1.1.0 thanks