Re: orc vs parquet aggregation, orc is really slow

Maurin Lenglart Sat, 16 Apr 2016 01:10:13 -0700

Hi,
I have : 17970737 rows
I tried to do a “desc formatted statistics myTable” but I get “Error while 
compiling statement: FAILED: SemanticException [Error 10001]: Table not found 
statistics”
Even after doing something like : “ANALYZE TABLE myTable COMPUTE STATISTICS FOR 
COLUMNS"


Thank you for your answer.

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 12:32 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Have you analysed statistics on the ORC table? How many rows are there?

Also send the outp of

desc formatted statistics <TABLE_NAME>

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 16 April 2016 at 08:20, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 20000”

My table was created something like :
  CREATE TABLE myTable (
  bookings            DOUBLE
  , deal views          INT
  )
   STORED AS ORC or PARQUET
     PARTITION BY (event_date STRING)

PARQUET take 9second of cumulative CPU
ORC take 50second of cumulative CPU.

For ORC I have tried to 
hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
But it didn’t change anything

I am missing something, or parquet is better for this type of query?

I am using spark 1.6.0 with hive 1.1.0

thanks

Re: orc vs parquet aggregation, orc is really slow

Reply via email to