Apologies that should read *desc formatted <table_name>*
Example for table dummy hive> desc formatted dummy; OK # col_name data_type comment id int clustered int scattered int randomised int random_string varchar(50) small_vc varchar(10) padding varchar(10) # Detailed Table Information Database: test Owner: hduser CreateTime: Sun Jan 31 06:09:56 GMT 2016 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 35 numRows 100000000 orc.bloom.filter.columns ID orc.bloom.filter.fpp 0.05 orc.compress SNAPPY orc.create.index true orc.row.index.stride 10000 orc.stripe.size 16777216 rawDataSize 0 totalSize 5660813776 transient_lastDdlTime 1458636871 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [id] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.065 seconds, Fetched: 42 row(s) Store dates as dates. To convert a string to date for a column like paymentdate (note in below UK date formatted converted to Hive yyyy-MM-dd)use the following: TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) AS paymentdate HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com> wrote: > Hi, > I have : 17970737 rows > I tried to do a “desc formatted statistics myTable” but I get “Error > while compiling statement: FAILED: SemanticException [Error 10001]: Table > not found statistics” > Even after doing something like : “ANALYZE TABLE myTable COMPUTE > STATISTICS FOR COLUMNS" > > Thank you for your answer. > > From: Mich Talebzadeh <mich.talebza...@gmail.com> > Date: Saturday, April 16, 2016 at 12:32 AM > To: maurin lenglart <mau...@cuberonlabs.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: orc vs parquet aggregation, orc is really slow > > Have you analysed statistics on the ORC table? How many rows are there? > > Also send the outp of > > desc formatted statistics <TABLE_NAME> > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com> wrote: > >> Hi, >> I am executing one query : >> “SELECT `event_date` as `event_date`,sum(`bookings`) as >> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` >> LIMIT 20000” >> >> My table was created something like : >> CREATE TABLE myTable ( >> bookings DOUBLE >> , deal views INT >> ) >> STORED AS ORC or PARQUET >> PARTITION BY (event_date STRING) >> >> PARQUET take 9second of cumulative CPU >> ORC take 50second of cumulative CPU. >> >> For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown >> ”,“true”) >> But it didn’t change anything >> >> I am missing something, or parquet is better for this type of query? >> >> I am using spark 1.6.0 with hive 1.1.0 >> >> thanks >> >> >> >