hang on so it takes 15 seconds to switch the database context with HiveContext.sql("use myDatabase") ?
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 17 April 2016 at 22:44, Maurin Lenglart <mau...@cuberonlabs.com> wrote: > The stats are only for one file in one partition. There is 17970737 rows > in total. > The table is not bucketed. > > The problem is not inserting rows, the problem is with this SQL query: > > “SELECT `event_date` as `event_date`,sum(`bookings`) as > `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` > >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` > LIMIT 20000” > > Benchmarks : > > - 8 seconds on parquet table loaded using > > sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’) > - 50 seconds on ORC > using > sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’) > - 15 seconds on ORC using sqlContext(‘use myDatabase’) > > The use case that I have is the second and slowest benchmark. Is there > something I can do to speed that up? > > thanks > > > > From: Mich Talebzadeh <mich.talebza...@gmail.com> > Date: Sunday, April 17, 2016 at 2:22 PM > > To: maurin lenglart <mau...@cuberonlabs.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: orc vs parquet aggregation, orc is really slow > > Hi Maurin, > > Have you tried to create your table in Hive as parquet table? This table > is pretty small with 100K rows. > > Is Hive table bucketed at all? I gather your issue is inserting rows into > Hive table at the moment that taking longer time (compared to Parquet)? > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 17 April 2016 at 21:43, Maurin Lenglart <mau...@cuberonlabs.com> wrote: > >> Hi, >> I am using cloudera distribution, and when I do a" desc formatted table” >> I don t get all the table parameters. >> >> But I did a hive orcfiledump on one random file ( I replaced some of the >> values that can be sensible) : >> hive --orcfiledump >> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 >> 2016-04-17 01:36:15,424 WARN [main] mapreduce.TableMapReduceUtil: The >> hbase-prefix-tree module jar containing PrefixTreeCodec is not present. >> Continuing without it. >> Structure for /user/hive/warehouse/myDB.db/mytable >> /event_date=2016-04-01/part-00001 >> File Version: 0.12 with HIVE_8732 >> 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from >> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 >> with {include: null, offset: 0, length: 9223372036854775807} >> Rows: 104260 >> Compression: ZLIB >> Compression size: 262144 >> Type: struct<my table structure,the columns in the query have the idx 10 >> and 13 and > >> >> Stripe Statistics: >> Stripe 1: >> Column 0: count: 104260 hasNull: false >> Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456 >> Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322 >> Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629 >> Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221 >> Column 5: count: 104260 hasNull: false min: XXX max: Others sum: >> 262478 >> Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591 >> Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288 >> Column 8: count: 104260 hasNull: false min: max: XXX sum: 1657073 >> Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: >> 730846 >> Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: >> 5481629 >> Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946 >> Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: >> 324522.0 >> Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: >> 6958348.122699987 >> Column 14: count: 104260 hasNull: false min: -2273.0 max: >> 39930.13977860418 sum: 1546639.6964531767 >> Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814 >> Column 16: count: 104260 hasNull: false min: 0.0 max: >> 4824.029119913681 sum: 45711.881143417035 >> Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866 >> Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864 >> Column 19: count: 104260 hasNull: false min: 2016-04-03 max: >> 2016-04-03 sum: 1042600 >> >> File Statistics: >> Column 0: count: 104260 hasNull: false >> Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456 >> Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322 >> Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: >> 738629 >> Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221 >> Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478 >> Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591 >> Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288 >> Column 8: count: 104260 hasNull: false min: max: Travel sum: 1657073 >> Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: >> 730846 >> Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629 >> Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946 >> Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: >> 324522.0 >> Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: >> 6958348.122699987 >> Column 14: count: 104260 hasNull: false min: -2273.0 max: >> 39930.13977860418 sum: 1546639.6964531767 >> Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814 >> Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 >> sum: 45711.881143417035 >> Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866 >> Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864 >> Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 >> sum: 1042600 >> >> Stripes: >> Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665 >> Stream: column 0 section ROW_INDEX start: 3 length 21 >> Stream: column 1 section ROW_INDEX start: 24 length 148 >> Stream: column 2 section ROW_INDEX start: 172 length 160 >> Stream: column 3 section ROW_INDEX start: 332 length 168 >> Stream: column 4 section ROW_INDEX start: 500 length 133 >> Stream: column 5 section ROW_INDEX start: 633 length 152 >> Stream: column 6 section ROW_INDEX start: 785 length 141 >> Stream: column 7 section ROW_INDEX start: 926 length 165 >> Stream: column 8 section ROW_INDEX start: 1091 length 150 >> Stream: column 9 section ROW_INDEX start: 1241 length 160 >> Stream: column 10 section ROW_INDEX start: 1401 length 205 >> Stream: column 11 section ROW_INDEX start: 1606 length 200 >> Stream: column 12 section ROW_INDEX start: 1806 length 201 >> Stream: column 13 section ROW_INDEX start: 2007 length 292 >> Stream: column 14 section ROW_INDEX start: 2299 length 375 >> Stream: column 15 section ROW_INDEX start: 2674 length 191 >> Stream: column 16 section ROW_INDEX start: 2865 length 295 >> Stream: column 17 section ROW_INDEX start: 3160 length 194 >> Stream: column 18 section ROW_INDEX start: 3354 length 192 >> Stream: column 19 section ROW_INDEX start: 3546 length 122 >> Stream: column 1 section DATA start: 3668 length 33586 >> Stream: column 1 section LENGTH start: 37254 length 7 >> Stream: column 1 section DICTIONARY_DATA start: 37261 length 14 >> Stream: column 2 section DATA start: 37275 length 40616 >> Stream: column 2 section LENGTH start: 77891 length 8 >> Stream: column 2 section DICTIONARY_DATA start: 77899 length 32 >> Stream: column 3 section DATA start: 77931 length 46120 >> Stream: column 3 section LENGTH start: 124051 length 17 >> Stream: column 3 section DICTIONARY_DATA start: 124068 length 99 >> Stream: column 4 section DATA start: 124167 length 26498 >> Stream: column 5 section DATA start: 150665 length 38409 >> Stream: column 5 section LENGTH start: 189074 length 8 >> Stream: column 5 section DICTIONARY_DATA start: 189082 length 30 >> Stream: column 6 section DATA start: 189112 length 9425 >> Stream: column 6 section LENGTH start: 198537 length 9 >> Stream: column 6 section DICTIONARY_DATA start: 198546 length 36 >> Stream: column 7 section DATA start: 198582 length 95465 >> Stream: column 7 section LENGTH start: 294047 length 127 >> Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130 >> Stream: column 8 section DATA start: 295304 length 43896 >> Stream: column 8 section LENGTH start: 339200 length 16 >> Stream: column 8 section DICTIONARY_DATA start: 339216 length 107 >> Stream: column 9 section DATA start: 339323 length 42274 >> Stream: column 9 section LENGTH start: 381597 length 9 >> Stream: column 9 section DICTIONARY_DATA start: 381606 length 55 >> Stream: column 10 section DATA start: 381661 length 89206 >> Stream: column 11 section DATA start: 470867 length 46613 >> Stream: column 12 section DATA start: 517480 length 62630 >> Stream: column 13 section DATA start: 580110 length 103241 >> Stream: column 14 section DATA start: 683351 length 138479 >> Stream: column 15 section DATA start: 821830 length 34983 >> Stream: column 16 section DATA start: 856813 length 7917 >> Stream: column 17 section DATA start: 864730 length 47435 >> Stream: column 18 section DATA start: 912165 length 462 >> Stream: column 19 section DATA start: 912627 length 140 >> Stream: column 19 section LENGTH start: 912767 length 6 >> Stream: column 19 section DICTIONARY_DATA start: 912773 length 13 >> Encoding column 0: DIRECT >> Encoding column 1: DICTIONARY_V2[3] >> Encoding column 2: DICTIONARY_V2[5] >> Encoding column 3: DICTIONARY_V2[12] >> Encoding column 4: DIRECT_V2 >> Encoding column 5: DICTIONARY_V2[6] >> Encoding column 6: DICTIONARY_V2[8] >> Encoding column 7: DICTIONARY_V2[175] >> Encoding column 8: DICTIONARY_V2[11] >> Encoding column 9: DICTIONARY_V2[7] >> Encoding column 10: DIRECT_V2 >> Encoding column 11: DIRECT_V2 >> Encoding column 12: DIRECT >> Encoding column 13: DIRECT >> Encoding column 14: DIRECT >> Encoding column 15: DIRECT_V2 >> Encoding column 16: DIRECT >> Encoding column 17: DIRECT_V2 >> Encoding column 18: DIRECT_V2 >> Encoding column 19: DICTIONARY_V2[1] >> >> File length: 914171 bytes >> Padding length: 0 bytes >> Padding ratio: 0% >> >> >> I also noticed something : >> If I load a table like that : >> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’) >> The queries done on myTable take at least twice the amount of time >> compared to queries done on the table loaded with hive directly. >> For technical reasons my pipeline is not fully migrated to use hive >> tables, and in a lot of place I still manually load the table and register >> it. >> I only see that problem with ORC format. >> Do you see a solution so I could have the same performance on loaded >> tables? >> >> thanks >> >> >> >> >> >> From: Mich Talebzadeh <mich.talebza...@gmail.com> >> Date: Saturday, April 16, 2016 at 4:14 AM >> To: maurin lenglart <mau...@cuberonlabs.com>, "user @spark" < >> user@spark.apache.org> >> >> Subject: Re: orc vs parquet aggregation, orc is really slow >> >> Apologies that should read >> >> *desc formatted <table_name>* >> >> Example for table dummy >> >> hive> desc formatted dummy; >> OK >> # col_name data_type comment >> id int >> clustered int >> scattered int >> randomised int >> random_string varchar(50) >> small_vc varchar(10) >> padding varchar(10) >> # Detailed Table Information >> Database: test >> Owner: hduser >> CreateTime: Sun Jan 31 06:09:56 GMT 2016 >> LastAccessTime: UNKNOWN >> Retention: 0 >> Location: >> hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy >> Table Type: MANAGED_TABLE >> Table Parameters: >> COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} >> numFiles 35 >> numRows 100000000 >> orc.bloom.filter.columns ID >> orc.bloom.filter.fpp 0.05 >> orc.compress SNAPPY >> orc.create.index true >> orc.row.index.stride 10000 >> orc.stripe.size 16777216 >> rawDataSize 0 >> totalSize 5660813776 >> transient_lastDdlTime 1458636871 >> # Storage Information >> SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde >> InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat >> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat >> Compressed: No >> Num Buckets: 256 >> Bucket Columns: [id] >> Sort Columns: [] >> Storage Desc Params: >> serialization.format 1 >> Time taken: 0.065 seconds, Fetched: 42 row(s) >> >> Store dates as dates. >> >> To convert a string to date for a column like paymentdate (note in below >> UK date formatted converted to Hive yyyy-MM-dd)use the following: >> >> >> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) >> AS paymentdate >> HTH >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com> >> wrote: >> >>> Hi, >>> I have : 17970737 rows >>> I tried to do a “desc formatted statistics myTable” but I get “Error >>> while compiling statement: FAILED: SemanticException [Error 10001]: Table >>> not found statistics” >>> Even after doing something like : “ANALYZE TABLE myTable COMPUTE >>> STATISTICS FOR COLUMNS" >>> >>> Thank you for your answer. >>> >>> From: Mich Talebzadeh <mich.talebza...@gmail.com> >>> Date: Saturday, April 16, 2016 at 12:32 AM >>> To: maurin lenglart <mau...@cuberonlabs.com> >>> Cc: "user @spark" <user@spark.apache.org> >>> Subject: Re: orc vs parquet aggregation, orc is really slow >>> >>> Have you analysed statistics on the ORC table? How many rows are there? >>> >>> Also send the outp of >>> >>> desc formatted statistics <TABLE_NAME> >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com> >>> wrote: >>> >>>> Hi, >>>> I am executing one query : >>>> “SELECT `event_date` as `event_date`,sum(`bookings`) as >>>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >>>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` >>>> LIMIT 20000” >>>> >>>> My table was created something like : >>>> CREATE TABLE myTable ( >>>> bookings DOUBLE >>>> , deal views INT >>>> ) >>>> STORED AS ORC or PARQUET >>>> PARTITION BY (event_date STRING) >>>> >>>> PARQUET take 9second of cumulative CPU >>>> ORC take 50second of cumulative CPU. >>>> >>>> For ORC I have tried to hiveContext.setConf(“ >>>> Spark.Sql.Orc.FilterPushdown”,“true”) >>>> But it didn’t change anything >>>> >>>> I am missing something, or parquet is better for this type of query? >>>> >>>> I am using spark 1.6.0 with hive 1.1.0 >>>> >>>> thanks >>>> >>>> >>>> >>> >> >