Re: orc vs parquet aggregation, orc is really slow

Jörn Franke Sat, 16 Apr 2016 01:03:07 -0700

Generally a recommendation (besides the issue) - Do not put dates as String. I 
recommend here to make them ints. It will be in both cases much faster.


It could be that you load them differently in the tables. Generally for these 
tables you should insert them in both cases sorted into the tables.
It could be also that in one case you compress the file and in the other not. 
It is always a good practice to have all options in the create table statement 
- even the default ones.

Hive seems a little bit outdated. Do you use Spark as an execution engine? Then 
you should upgrade to newer versions of Hive. The Spark execution engine on 
hive is still a little bit more experimental than TEZ. Depends also which 
distribution you are using.

Normally I would expect both of them to perform similarly.

> On 16 Apr 2016, at 09:20, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
> 
> Hi,
> I am executing one query : 
> “SELECT `event_date` as `event_date`,sum(`bookings`) as 
> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` 
> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 
> 20000”
> 
> My table was created something like :
>   
> CREATE TABLE myTable (
>   bookings            DOUBLE
>   , deal views          INT
>   )
>    STORED AS ORC or PARQUET
>      PARTITION BY (event_date STRING)
> 
> PARQUET take 9second of cumulative CPU
> ORC take 50second of cumulative CPU. 
> 
> For ORC I have tried to 
> hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
> But it didn’t change anything
> 
> I am missing something, or parquet is better for this type of query?
> 
> I am using spark 1.6.0 with hive 1.1.0
> 
> thanks
> 
>

Re: orc vs parquet aggregation, orc is really slow

Reply via email to