Hi Maurin, Have you tried to create your table in Hive as parquet table? This table is pretty small with 100K rows.
Is Hive table bucketed at all? I gather your issue is inserting rows into Hive table at the moment that taking longer time (compared to Parquet)? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 17 April 2016 at 21:43, Maurin Lenglart <mau...@cuberonlabs.com> wrote: > Hi, > I am using cloudera distribution, and when I do a" desc formatted table” I > don t get all the table parameters. > > But I did a hive orcfiledump on one random file ( I replaced some of the > values that can be sensible) : > hive --orcfiledump > /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 > 2016-04-17 01:36:15,424 WARN [main] mapreduce.TableMapReduceUtil: The > hbase-prefix-tree module jar containing PrefixTreeCodec is not present. > Continuing without it. > Structure for /user/hive/warehouse/myDB.db/mytable > /event_date=2016-04-01/part-00001 > File Version: 0.12 with HIVE_8732 > 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-00001 > with {include: null, offset: 0, length: 9223372036854775807} > Rows: 104260 > Compression: ZLIB > Compression size: 262144 > Type: struct<my table structure,the columns in the query have the idx 10 > and 13 and > > > Stripe Statistics: > Stripe 1: > Column 0: count: 104260 hasNull: false > Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456 > Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322 > Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629 > Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221 > Column 5: count: 104260 hasNull: false min: XXX max: Others sum: > 262478 > Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591 > Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288 > Column 8: count: 104260 hasNull: false min: max: XXX sum: 1657073 > Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: > 730846 > Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629 > Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946 > Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: > 324522.0 > Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: > 6958348.122699987 > Column 14: count: 104260 hasNull: false min: -2273.0 max: > 39930.13977860418 sum: 1546639.6964531767 > Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814 > Column 16: count: 104260 hasNull: false min: 0.0 max: > 4824.029119913681 sum: 45711.881143417035 > Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866 > Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864 > Column 19: count: 104260 hasNull: false min: 2016-04-03 max: > 2016-04-03 sum: 1042600 > > File Statistics: > Column 0: count: 104260 hasNull: false > Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456 > Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322 > Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: > 738629 > Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221 > Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478 > Column 6: count: 104260 hasNull: false min: max: XXX sum: 220591 > Column 7: count: 104260 hasNull: false min: max: XXX sum: 1102288 > Column 8: count: 104260 hasNull: false min: max: Travel sum: 1657073 > Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846 > Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629 > Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946 > Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: > 324522.0 > Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: > 6958348.122699987 > Column 14: count: 104260 hasNull: false min: -2273.0 max: > 39930.13977860418 sum: 1546639.6964531767 > Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814 > Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 > sum: 45711.881143417035 > Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866 > Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864 > Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 > sum: 1042600 > > Stripes: > Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665 > Stream: column 0 section ROW_INDEX start: 3 length 21 > Stream: column 1 section ROW_INDEX start: 24 length 148 > Stream: column 2 section ROW_INDEX start: 172 length 160 > Stream: column 3 section ROW_INDEX start: 332 length 168 > Stream: column 4 section ROW_INDEX start: 500 length 133 > Stream: column 5 section ROW_INDEX start: 633 length 152 > Stream: column 6 section ROW_INDEX start: 785 length 141 > Stream: column 7 section ROW_INDEX start: 926 length 165 > Stream: column 8 section ROW_INDEX start: 1091 length 150 > Stream: column 9 section ROW_INDEX start: 1241 length 160 > Stream: column 10 section ROW_INDEX start: 1401 length 205 > Stream: column 11 section ROW_INDEX start: 1606 length 200 > Stream: column 12 section ROW_INDEX start: 1806 length 201 > Stream: column 13 section ROW_INDEX start: 2007 length 292 > Stream: column 14 section ROW_INDEX start: 2299 length 375 > Stream: column 15 section ROW_INDEX start: 2674 length 191 > Stream: column 16 section ROW_INDEX start: 2865 length 295 > Stream: column 17 section ROW_INDEX start: 3160 length 194 > Stream: column 18 section ROW_INDEX start: 3354 length 192 > Stream: column 19 section ROW_INDEX start: 3546 length 122 > Stream: column 1 section DATA start: 3668 length 33586 > Stream: column 1 section LENGTH start: 37254 length 7 > Stream: column 1 section DICTIONARY_DATA start: 37261 length 14 > Stream: column 2 section DATA start: 37275 length 40616 > Stream: column 2 section LENGTH start: 77891 length 8 > Stream: column 2 section DICTIONARY_DATA start: 77899 length 32 > Stream: column 3 section DATA start: 77931 length 46120 > Stream: column 3 section LENGTH start: 124051 length 17 > Stream: column 3 section DICTIONARY_DATA start: 124068 length 99 > Stream: column 4 section DATA start: 124167 length 26498 > Stream: column 5 section DATA start: 150665 length 38409 > Stream: column 5 section LENGTH start: 189074 length 8 > Stream: column 5 section DICTIONARY_DATA start: 189082 length 30 > Stream: column 6 section DATA start: 189112 length 9425 > Stream: column 6 section LENGTH start: 198537 length 9 > Stream: column 6 section DICTIONARY_DATA start: 198546 length 36 > Stream: column 7 section DATA start: 198582 length 95465 > Stream: column 7 section LENGTH start: 294047 length 127 > Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130 > Stream: column 8 section DATA start: 295304 length 43896 > Stream: column 8 section LENGTH start: 339200 length 16 > Stream: column 8 section DICTIONARY_DATA start: 339216 length 107 > Stream: column 9 section DATA start: 339323 length 42274 > Stream: column 9 section LENGTH start: 381597 length 9 > Stream: column 9 section DICTIONARY_DATA start: 381606 length 55 > Stream: column 10 section DATA start: 381661 length 89206 > Stream: column 11 section DATA start: 470867 length 46613 > Stream: column 12 section DATA start: 517480 length 62630 > Stream: column 13 section DATA start: 580110 length 103241 > Stream: column 14 section DATA start: 683351 length 138479 > Stream: column 15 section DATA start: 821830 length 34983 > Stream: column 16 section DATA start: 856813 length 7917 > Stream: column 17 section DATA start: 864730 length 47435 > Stream: column 18 section DATA start: 912165 length 462 > Stream: column 19 section DATA start: 912627 length 140 > Stream: column 19 section LENGTH start: 912767 length 6 > Stream: column 19 section DICTIONARY_DATA start: 912773 length 13 > Encoding column 0: DIRECT > Encoding column 1: DICTIONARY_V2[3] > Encoding column 2: DICTIONARY_V2[5] > Encoding column 3: DICTIONARY_V2[12] > Encoding column 4: DIRECT_V2 > Encoding column 5: DICTIONARY_V2[6] > Encoding column 6: DICTIONARY_V2[8] > Encoding column 7: DICTIONARY_V2[175] > Encoding column 8: DICTIONARY_V2[11] > Encoding column 9: DICTIONARY_V2[7] > Encoding column 10: DIRECT_V2 > Encoding column 11: DIRECT_V2 > Encoding column 12: DIRECT > Encoding column 13: DIRECT > Encoding column 14: DIRECT > Encoding column 15: DIRECT_V2 > Encoding column 16: DIRECT > Encoding column 17: DIRECT_V2 > Encoding column 18: DIRECT_V2 > Encoding column 19: DICTIONARY_V2[1] > > File length: 914171 bytes > Padding length: 0 bytes > Padding ratio: 0% > > > I also noticed something : > If I load a table like that : > sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’) > The queries done on myTable take at least twice the amount of time > compared to queries done on the table loaded with hive directly. > For technical reasons my pipeline is not fully migrated to use hive > tables, and in a lot of place I still manually load the table and register > it. > I only see that problem with ORC format. > Do you see a solution so I could have the same performance on loaded > tables? > > thanks > > > > > > From: Mich Talebzadeh <mich.talebza...@gmail.com> > Date: Saturday, April 16, 2016 at 4:14 AM > To: maurin lenglart <mau...@cuberonlabs.com>, "user @spark" < > user@spark.apache.org> > > Subject: Re: orc vs parquet aggregation, orc is really slow > > Apologies that should read > > *desc formatted <table_name>* > > Example for table dummy > > hive> desc formatted dummy; > OK > # col_name data_type comment > id int > clustered int > scattered int > randomised int > random_string varchar(50) > small_vc varchar(10) > padding varchar(10) > # Detailed Table Information > Database: test > Owner: hduser > CreateTime: Sun Jan 31 06:09:56 GMT 2016 > LastAccessTime: UNKNOWN > Retention: 0 > Location: > hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} > numFiles 35 > numRows 100000000 > orc.bloom.filter.columns ID > orc.bloom.filter.fpp 0.05 > orc.compress SNAPPY > orc.create.index true > orc.row.index.stride 10000 > orc.stripe.size 16777216 > rawDataSize 0 > totalSize 5660813776 > transient_lastDdlTime 1458636871 > # Storage Information > SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > Compressed: No > Num Buckets: 256 > Bucket Columns: [id] > Sort Columns: [] > Storage Desc Params: > serialization.format 1 > Time taken: 0.065 seconds, Fetched: 42 row(s) > > Store dates as dates. > > To convert a string to date for a column like paymentdate (note in below > UK date formatted converted to Hive yyyy-MM-dd)use the following: > > > TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) > AS paymentdate > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com> wrote: > >> Hi, >> I have : 17970737 rows >> I tried to do a “desc formatted statistics myTable” but I get “Error >> while compiling statement: FAILED: SemanticException [Error 10001]: Table >> not found statistics” >> Even after doing something like : “ANALYZE TABLE myTable COMPUTE >> STATISTICS FOR COLUMNS" >> >> Thank you for your answer. >> >> From: Mich Talebzadeh <mich.talebza...@gmail.com> >> Date: Saturday, April 16, 2016 at 12:32 AM >> To: maurin lenglart <mau...@cuberonlabs.com> >> Cc: "user @spark" <user@spark.apache.org> >> Subject: Re: orc vs parquet aggregation, orc is really slow >> >> Have you analysed statistics on the ORC table? How many rows are there? >> >> Also send the outp of >> >> desc formatted statistics <TABLE_NAME> >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com> >> wrote: >> >>> Hi, >>> I am executing one query : >>> “SELECT `event_date` as `event_date`,sum(`bookings`) as >>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` >>> LIMIT 20000” >>> >>> My table was created something like : >>> CREATE TABLE myTable ( >>> bookings DOUBLE >>> , deal views INT >>> ) >>> STORED AS ORC or PARQUET >>> PARTITION BY (event_date STRING) >>> >>> PARQUET take 9second of cumulative CPU >>> ORC take 50second of cumulative CPU. >>> >>> For ORC I have tried to hiveContext.setConf(“ >>> Spark.Sql.Orc.FilterPushdown”,“true”) >>> But it didn’t change anything >>> >>> I am missing something, or parquet is better for this type of query? >>> >>> I am using spark 1.6.0 with hive 1.1.0 >>> >>> thanks >>> >>> >>> >> >