Hi, I have : 17970737 rows I tried to do a “desc formatted statistics myTable” but I get “Error while compiling statement: FAILED: SemanticException [Error 10001]: Table not found statistics” Even after doing something like : “ANALYZE TABLE myTable COMPUTE STATISTICS FOR COLUMNS"
Thank you for your answer. From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Date: Saturday, April 16, 2016 at 12:32 AM To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow Have you analysed statistics on the ORC table? How many rows are there? Also send the outp of desc formatted statistics <TABLE_NAME> HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote: Hi, I am executing one query : “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000” My table was created something like : CREATE TABLE myTable ( bookings DOUBLE , deal views INT ) STORED AS ORC or PARQUET PARTITION BY (event_date STRING) PARQUET take 9second of cumulative CPU ORC take 50second of cumulative CPU. For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”) But it didn’t change anything I am missing something, or parquet is better for this type of query? I am using spark 1.6.0 with hive 1.1.0 thanks