Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Rajesh Balamohan
1. In first case (i.e in cluster where you have hive and spark), it would
have executed via HiveTableScan instead of OrcRelation. HiveTableScan would
not propagate any PPD related information to ORC readers (SPARK-12998). PPD
might not play a big role here as your where conditions seem to be only on
event_date partition.  Check if the time taken to launch itself is taking
time. In 1.6.x, HiveTableScan gets a threaddump for every reader.
SPARK-12898 might improve if that is the case. Also, SPARK-12925 can
improve perf if you are using lots of String columns.

2. In the second case (i.e in cluster where you have spark alone), can you
check if it is taking time launch the job or is it slow overall?. You can
possibly check the timing from spark-ui and compare it with the overall job
runtime,
- SPARK-12925 as mentioned in the first case can also impact perf in case
you have lots string columns.
- ORCrelation internally makes use of HadoopRDD. In case there are lots of
partitions created in your case, SPARK-14113 can have an impact as closure
cleaning is called every time an RDD is created. This has the potential to
slow down the overall job runtime depending on the number of partitions.
- OrcRelation suffers from callSite problem as well, so for every partition
it incurs the threaddump call which can impact depending on the number of
partitions being crated.
- Parquet caches lots of meta information. So the round trips to NN is
fairly low in case of Parq. So when you run the query second or third time,
Parq queries tend to run in lesser runtime. In ORCRelation, there is no
such meta caching and still incurs the NN cost.

~Rajesh.B

On Mon, Apr 18, 2016 at 3:29 AM, Maurin Lenglart <mau...@cuberonlabs.com>
wrote:

>
> Let me explain a little my architecture:
> I have one cluster with hive and spark. Over there I create my databases
> and create the tables and insert data in them.
> If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) using ORC, it take 15 sec
>
> But then I export the tables to an other cluster where I don’t have hive.
> So I load my tables using
>
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
> Then this query: self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 50 seconds.
>
> If I do the same process using parquet tables self.sqlContext.sql(“SELECT
> `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`)
> as `dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND
> `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8
> seconds.
>
>
> thanks
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Sunday, April 17, 2016 at 2:52 PM
>
> To: maurin lenglart <mau...@cuberonlabs.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> hang on so it takes 15 seconds to switch the database context with 
> HiveContext.sql("use
> myDatabase") ?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 22:44, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> The stats are only for one file in one partition. There is 17970737 rows
>> in total.
>> The table is not bucketed.
>>
>> The problem is not inserting rows, the problem is with this SQL query:
>>
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 2”
>>
>> Benchmarks :
>>
>>- 8 seconds on parquet table loaded using
>> 
>> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 50 seconds on ORC
>>using  
>> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 15 seconds on ORC using sqlContext(‘use myDatabase’)
>>
>> The use case that I have is the second and slowest benchmark. Is there
>> something I can do to speed that up?
>>
>> thanks
&

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
Ok that may explain. In another cluster you register it as temp table and
then collect data using SQL running against that temp table which loads
the data at that point and if you do not have enough memory for your temp
table, it will have to spill it to disk and do many passes. Could that be a
plausible explanation without knowing the details?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:59, Maurin Lenglart <mau...@cuberonlabs.com> wrote:

>
> Let me explain a little my architecture:
> I have one cluster with hive and spark. Over there I create my databases
> and create the tables and insert data in them.
> If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) using ORC, it take 15 sec
>
> But then I export the tables to an other cluster where I don’t have hive.
> So I load my tables using
>
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
> Then this query: self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 50 seconds.
>
> If I do the same process using parquet tables self.sqlContext.sql(“SELECT
> `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`)
> as `dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND
> `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8
> seconds.
>
>
> thanks
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Sunday, April 17, 2016 at 2:52 PM
>
> To: maurin lenglart <mau...@cuberonlabs.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> hang on so it takes 15 seconds to switch the database context with 
> HiveContext.sql("use
> myDatabase") ?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 22:44, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> The stats are only for one file in one partition. There is 17970737 rows
>> in total.
>> The table is not bucketed.
>>
>> The problem is not inserting rows, the problem is with this SQL query:
>>
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 2”
>>
>> Benchmarks :
>>
>>- 8 seconds on parquet table loaded using
>> 
>> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 50 seconds on ORC
>>using  
>> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 15 seconds on ORC using sqlContext(‘use myDatabase’)
>>
>> The use case that I have is the second and slowest benchmark. Is there
>> something I can do to speed that up?
>>
>> thanks
>>
>>
>>
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Date: Sunday, April 17, 2016 at 2:22 PM
>>
>> To: maurin lenglart <mau...@cuberonlabs.com>
>> Cc: "user @spark" <user@spark.apache.org>
>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>
>> Hi Maurin,
>>
>> Have you tried to create your table in Hive as parquet table? This table
>> is pretty small with 100K rows.
>>
>> Is Hive table bucketed at all? I gather your issue is inserting rows into
>> Hive table at the moment that taking longer time (compared to Parquet)?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart

Let me explain a little my architecture:
I have one cluster with hive and spark. Over there I create my databases and 
create the tables and insert data in them.
If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as 
`event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM 
myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' 
GROUP BY `event_date` LIMIT 2”) using ORC, it take 15 sec

But then I export the tables to an other cluster where I don’t have hive. So I 
load my tables using
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
Then this query: self.sqlContext.sql(“SELECT `event_date` as 
`event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM 
myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' 
GROUP BY `event_date` LIMIT 2”) take 50 seconds.

If I do the same process using parquet tables self.sqlContext.sql(“SELECT 
`event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as 
`dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` 
<= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8 seconds.


thanks

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:52 PM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

hang on so it takes 15 seconds to switch the database context with 
HiveContext.sql("use myDatabase") ?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 17 April 2016 at 22:44, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
The stats are only for one file in one partition. There is 17970737 rows in 
total.
The table is not bucketed.

The problem is not inserting rows, the problem is with this SQL query:

“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”

Benchmarks :

  *   8 seconds on parquet table loaded using  
sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   50 seconds on ORC using  
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   15 seconds on ORC using sqlContext(‘use myDatabase’)

The use case that I have is the second and slowest benchmark. Is there 
something I can do to speed that up?

thanks



From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:22 PM

To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Hi Maurin,

Have you tried to create your table in Hive as parquet table? This table is 
pretty small with 100K rows.

Is Hive table bucketed at all? I gather your issue is inserting rows into Hive 
table at the moment that taking longer time (compared to Parquet)?

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 17 April 2016 at 21:43, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I am using cloudera distribution, and when I do a" desc formatted table” I don 
t get all the table parameters.

But I did a hive orcfiledump on one random file ( I replaced some of the values 
that can be sensible) :
hive --orcfiledump 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The 
hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  
Continuing without it.
Structure for 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
File Version: 0.12 with HIVE_8732
16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1 with 
{include: null, offset: 0, length: 9223372036854775807}
Rows: 104260
Compression: ZLIB
Compression size: 262144
Type: struct

Stripe Statistics:
  Stripe 1:
Column 0: count: 104260 hasNull: false
Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
Column 3: count: 104260 h

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
hang on so it takes 15 seconds to switch the database context with
HiveContext.sql("use
myDatabase") ?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:44, Maurin Lenglart <mau...@cuberonlabs.com> wrote:

> The stats are only for one file in one partition. There is 17970737 rows
> in total.
> The table is not bucketed.
>
> The problem is not inserting rows, the problem is with this SQL query:
>
> “SELECT `event_date` as `event_date`,sum(`bookings`) as
> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
> LIMIT 2”
>
> Benchmarks :
>
>- 8 seconds on parquet table loaded using
> 
> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>- 50 seconds on ORC
>using  
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>- 15 seconds on ORC using sqlContext(‘use myDatabase’)
>
> The use case that I have is the second and slowest benchmark. Is there
> something I can do to speed that up?
>
> thanks
>
>
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Sunday, April 17, 2016 at 2:22 PM
>
> To: maurin lenglart <mau...@cuberonlabs.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Hi Maurin,
>
> Have you tried to create your table in Hive as parquet table? This table
> is pretty small with 100K rows.
>
> Is Hive table bucketed at all? I gather your issue is inserting rows into
> Hive table at the moment that taking longer time (compared to Parquet)?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 21:43, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> Hi,
>> I am using cloudera distribution, and when I do a" desc formatted table”
>> I don t get all the table parameters.
>>
>> But I did a hive orcfiledump on one random file ( I replaced some of the
>> values that can be sensible) :
>> hive --orcfiledump
>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
>> 2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The
>> hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
>> Continuing without it.
>> Structure for /user/hive/warehouse/myDB.db/mytable
>> /event_date=2016-04-01/part-1
>> File Version: 0.12 with HIVE_8732
>> 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from
>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
>> with {include: null, offset: 0, length: 9223372036854775807}
>> Rows: 104260
>> Compression: ZLIB
>> Compression size: 262144
>> Type: struct> and 13 and  >
>>
>> Stripe Statistics:
>>   Stripe 1:
>> Column 0: count: 104260 hasNull: false
>> Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>> Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>> Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
>> Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>> Column 5: count: 104260 hasNull: false min: XXX max: Others sum:
>> 262478
>> Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>> Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>> Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
>> Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
>> 730846
>> Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum:
>> 5481629
>> Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>> Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
>> 324522.0
>> Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
>> 6958348.122699987
>> Column 14: count: 104260 hasNull: false min: -2273.0 max:
>> 39930.13977860418 sum: 1546639.6964531767
>> Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>> Column 16: count: 104260 

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
The stats are only for one file in one partition. There is 17970737 rows in 
total.
The table is not bucketed.

The problem is not inserting rows, the problem is with this SQL query:

“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”

Benchmarks :

  *   8 seconds on parquet table loaded using  
sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   50 seconds on ORC using  
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   15 seconds on ORC using sqlContext(‘use myDatabase’)

The use case that I have is the second and slowest benchmark. Is there 
something I can do to speed that up?

thanks



From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:22 PM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Hi Maurin,

Have you tried to create your table in Hive as parquet table? This table is 
pretty small with 100K rows.

Is Hive table bucketed at all? I gather your issue is inserting rows into Hive 
table at the moment that taking longer time (compared to Parquet)?

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 17 April 2016 at 21:43, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I am using cloudera distribution, and when I do a" desc formatted table” I don 
t get all the table parameters.

But I did a hive orcfiledump on one random file ( I replaced some of the values 
that can be sensible) :
hive --orcfiledump 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The 
hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  
Continuing without it.
Structure for 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
File Version: 0.12 with HIVE_8732
16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1 with 
{include: null, offset: 0, length: 9223372036854775807}
Rows: 104260
Compression: ZLIB
Compression size: 262144
Type: struct

Stripe Statistics:
  Stripe 1:
Column 0: count: 104260 hasNull: false
Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 
6958348.122699987
Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 
sum: 1546639.6964531767
Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 
sum: 45711.881143417035
Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 
sum: 1042600

File Statistics:
  Column 0: count: 104260 hasNull: false
  Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
  Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
  Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: 738629
  Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
  Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
  Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
  Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
  Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
  Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
  Column 10: count: 104260 hasNull: false mi

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
 take at least twice the amount of time
> compared to queries done on the table loaded with hive directly.
> For technical reasons my pipeline is not fully migrated to use hive
> tables, and in a lot of place I still manually load the table and register
> it.
> I only see that problem with ORC format.
> Do you see a solution so I could have the same performance on loaded
> tables?
>
> thanks
>
>
>
>
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Saturday, April 16, 2016 at 4:14 AM
> To: maurin lenglart <mau...@cuberonlabs.com>, "user @spark" <
> user@spark.apache.org>
>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Apologies that should read
>
> *desc formatted *
>
> Example for table dummy
>
> hive> desc formatted  dummy;
> OK
> # col_name  data_type   comment
> id  int
> clustered   int
> scattered   int
> randomised  int
> random_string   varchar(50)
> small_vcvarchar(10)
> padding varchar(10)
> # Detailed Table Information
> Database:   test
> Owner:  hduser
> CreateTime: Sun Jan 31 06:09:56 GMT 2016
> LastAccessTime: UNKNOWN
> Retention:  0
> Location:
> hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
> numFiles35
> numRows 1
> orc.bloom.filter.columnsID
> orc.bloom.filter.fpp0.05
> orc.compressSNAPPY
> orc.create.indextrue
> orc.row.index.stride1
> orc.stripe.size 16777216
> rawDataSize 0
> totalSize   5660813776
> transient_lastDdlTime   1458636871
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> Compressed: No
> Num Buckets:256
> Bucket Columns: [id]
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.065 seconds, Fetched: 42 row(s)
>
> Store dates as dates.
>
> To convert a string to date for a column like paymentdate (note in below
> UK date formatted converted to Hive -MM-dd)use the following:
>
> 
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd'))
> AS paymentdate
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> Hi,
>> I have : 17970737 rows
>> I tried to do a “desc formatted statistics myTable” but I get “Error
>> while compiling statement: FAILED: SemanticException [Error 10001]: Table
>> not found statistics”
>> Even after doing something like : “ANALYZE TABLE myTable COMPUTE
>> STATISTICS FOR COLUMNS"
>>
>> Thank you for your answer.
>>
>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>> Date: Saturday, April 16, 2016 at 12:32 AM
>> To: maurin lenglart <mau...@cuberonlabs.com>
>> Cc: "user @spark" <user@spark.apache.org>
>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>
>> Have you analysed statistics on the ORC table? How many rows are there?
>>
>> Also send the outp of
>>
>> desc formatted statistics 
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com>
>> wrote:
>>
>>> Hi,
>>> I am executing one query :
>>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>>> LIMIT 2”
>>>
>>> My table was created something like :
>>>   CREATE TABLE myTable (
>>>   bookingsDOUBLE
>>>   , deal views  INT
>>>   )
>>>STORED AS ORC or PARQUET
>>>  PARTITION BY (event_date STRING)
>>>
>>> PARQUET take 9second of cumulative CPU
>>> ORC take 50second of cumulative CPU.
>>>
>>> For ORC I have tried to hiveContext.setConf(“
>>> Spark.Sql.Orc.FilterPushdown”,“true”)
>>> But it didn’t change anything
>>>
>>> I am missing something, or parquet is better for this type of query?
>>>
>>> I am using spark 1.6.0 with hive 1.1.0
>>>
>>> thanks
>>>
>>>
>>>
>>
>


Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
start: 2674 length 191
Stream: column 16 section ROW_INDEX start: 2865 length 295
Stream: column 17 section ROW_INDEX start: 3160 length 194
Stream: column 18 section ROW_INDEX start: 3354 length 192
Stream: column 19 section ROW_INDEX start: 3546 length 122
Stream: column 1 section DATA start: 3668 length 33586
Stream: column 1 section LENGTH start: 37254 length 7
Stream: column 1 section DICTIONARY_DATA start: 37261 length 14
Stream: column 2 section DATA start: 37275 length 40616
Stream: column 2 section LENGTH start: 77891 length 8
Stream: column 2 section DICTIONARY_DATA start: 77899 length 32
Stream: column 3 section DATA start: 77931 length 46120
Stream: column 3 section LENGTH start: 124051 length 17
Stream: column 3 section DICTIONARY_DATA start: 124068 length 99
Stream: column 4 section DATA start: 124167 length 26498
Stream: column 5 section DATA start: 150665 length 38409
Stream: column 5 section LENGTH start: 189074 length 8
Stream: column 5 section DICTIONARY_DATA start: 189082 length 30
Stream: column 6 section DATA start: 189112 length 9425
Stream: column 6 section LENGTH start: 198537 length 9
Stream: column 6 section DICTIONARY_DATA start: 198546 length 36
Stream: column 7 section DATA start: 198582 length 95465
Stream: column 7 section LENGTH start: 294047 length 127
Stream: column 7 section DICTIONARY_DATA start: 294174 length 1130
Stream: column 8 section DATA start: 295304 length 43896
Stream: column 8 section LENGTH start: 339200 length 16
Stream: column 8 section DICTIONARY_DATA start: 339216 length 107
Stream: column 9 section DATA start: 339323 length 42274
Stream: column 9 section LENGTH start: 381597 length 9
Stream: column 9 section DICTIONARY_DATA start: 381606 length 55
Stream: column 10 section DATA start: 381661 length 89206
Stream: column 11 section DATA start: 470867 length 46613
Stream: column 12 section DATA start: 517480 length 62630
Stream: column 13 section DATA start: 580110 length 103241
Stream: column 14 section DATA start: 683351 length 138479
Stream: column 15 section DATA start: 821830 length 34983
Stream: column 16 section DATA start: 856813 length 7917
Stream: column 17 section DATA start: 864730 length 47435
Stream: column 18 section DATA start: 912165 length 462
Stream: column 19 section DATA start: 912627 length 140
Stream: column 19 section LENGTH start: 912767 length 6
Stream: column 19 section DICTIONARY_DATA start: 912773 length 13
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[3]
Encoding column 2: DICTIONARY_V2[5]
Encoding column 3: DICTIONARY_V2[12]
Encoding column 4: DIRECT_V2
Encoding column 5: DICTIONARY_V2[6]
Encoding column 6: DICTIONARY_V2[8]
Encoding column 7: DICTIONARY_V2[175]
Encoding column 8: DICTIONARY_V2[11]
Encoding column 9: DICTIONARY_V2[7]
Encoding column 10: DIRECT_V2
Encoding column 11: DIRECT_V2
Encoding column 12: DIRECT
Encoding column 13: DIRECT
Encoding column 14: DIRECT
Encoding column 15: DIRECT_V2
Encoding column 16: DIRECT
Encoding column 17: DIRECT_V2
Encoding column 18: DIRECT_V2
Encoding column 19: DICTIONARY_V2[1]

File length: 914171 bytes
Padding length: 0 bytes
Padding ratio: 0%


I also noticed something :
If I load a table like that : 
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
The queries done on myTable take at least twice the amount of time compared to 
queries done on the table loaded with hive directly.
For technical reasons my pipeline is not fully migrated to use hive tables, and 
in a lot of place I still manually load the table and register it.
I only see that problem with ORC format.
Do you see a solution so I could have the same performance on loaded tables?

thanks





From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 4:14 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>, 
"user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Apologies that should read

desc formatted 

Example for table dummy

hive> desc formatted  dummy;
OK
# col_name  data_type   comment
id  int
clustered   int
scattered   int
randomised  int
random_string   varchar(50)
small_vcvarchar(10)
padding varchar(10)
# Detailed Table Information
Database:   test
Owner:  hduser
CreateTime: Sun Jan 31 06:09:56 GMT 2016
LastAccessTime: UNKNOWN
Retention:  0
Location:   hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Mich Talebzadeh
Apologies that should read

*desc formatted *

Example for table dummy

hive> desc formatted  dummy;
OK
# col_name  data_type   comment
id  int
clustered   int
scattered   int
randomised  int
random_string   varchar(50)
small_vcvarchar(10)
padding varchar(10)
# Detailed Table Information
Database:   test
Owner:  hduser
CreateTime: Sun Jan 31 06:09:56 GMT 2016
LastAccessTime: UNKNOWN
Retention:  0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles35
numRows 1
orc.bloom.filter.columnsID
orc.bloom.filter.fpp0.05
orc.compressSNAPPY
orc.create.indextrue
orc.row.index.stride1
orc.stripe.size 16777216
rawDataSize 0
totalSize   5660813776
transient_lastDdlTime   1458636871
# Storage Information
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets:256
Bucket Columns: [id]
Sort Columns:   []
Storage Desc Params:
serialization.format1
Time taken: 0.065 seconds, Fetched: 42 row(s)

Store dates as dates.

To convert a string to date for a column like paymentdate (note in below UK
date formatted converted to Hive -MM-dd)use the following:


TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd'))
AS paymentdate
HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 April 2016 at 09:09, Maurin Lenglart <mau...@cuberonlabs.com> wrote:

> Hi,
> I have : 17970737 rows
> I tried to do a “desc formatted statistics myTable” but I get “Error
> while compiling statement: FAILED: SemanticException [Error 10001]: Table
> not found statistics”
> Even after doing something like : “ANALYZE TABLE myTable COMPUTE
> STATISTICS FOR COLUMNS"
>
> Thank you for your answer.
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Saturday, April 16, 2016 at 12:32 AM
> To: maurin lenglart <mau...@cuberonlabs.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Have you analysed statistics on the ORC table? How many rows are there?
>
> Also send the outp of
>
> desc formatted statistics 
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 April 2016 at 08:20, Maurin Lenglart <mau...@cuberonlabs.com> wrote:
>
>> Hi,
>> I am executing one query :
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 2”
>>
>> My table was created something like :
>>   CREATE TABLE myTable (
>>   bookingsDOUBLE
>>   , deal views  INT
>>   )
>>STORED AS ORC or PARQUET
>>  PARTITION BY (event_date STRING)
>>
>> PARQUET take 9second of cumulative CPU
>> ORC take 50second of cumulative CPU.
>>
>> For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown
>> ”,“true”)
>> But it didn’t change anything
>>
>> I am missing something, or parquet is better for this type of query?
>>
>> I am using spark 1.6.0 with hive 1.1.0
>>
>> thanks
>>
>>
>>
>


Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi,

I will put the date in the correct format in the future. And see if that change 
anything.
The query that I sent is just an exemple of one aggregation possible, I have a 
lot of them possible on the same table, so I am not sure that sorting all of 
them could actually have an impact.

I am using the latest release of cloudera and I didn’t modified any version. Do 
you think that I should try to manually update hive ?

thanks

From: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Date: Saturday, April 16, 2016 at 1:02 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow


Generally a recommendation (besides the issue) - Do not put dates as String. I 
recommend here to make them ints. It will be in both cases much faster.

It could be that you load them differently in the tables. Generally for these 
tables you should insert them in both cases sorted into the tables.
It could be also that in one case you compress the file and in the other not. 
It is always a good practice to have all options in the create table statement 
- even the default ones.

Hive seems a little bit outdated. Do you use Spark as an execution engine? Then 
you should upgrade to newer versions of Hive. The Spark execution engine on 
hive is still a little bit more experimental than TEZ. Depends also which 
distribution you are using.

Normally I would expect both of them to perform similarly.

On 16 Apr 2016, at 09:20, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:

Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”

My table was created something like :
  CREATE TABLE myTable (
  bookingsDOUBLE
  , deal views  INT
  )
   STORED AS ORC or PARQUET
 PARTITION BY (event_date STRING)

PARQUET take 9second of cumulative CPU
ORC take 50second of cumulative CPU.

For ORC I have tried to 
hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
But it didn’t change anything

I am missing something, or parquet is better for this type of query?

I am using spark 1.6.0 with hive 1.1.0

thanks




Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi,
I have : 17970737 rows
I tried to do a “desc formatted statistics myTable” but I get “Error while 
compiling statement: FAILED: SemanticException [Error 10001]: Table not found 
statistics”
Even after doing something like : “ANALYZE TABLE myTable COMPUTE STATISTICS FOR 
COLUMNS"

Thank you for your answer.

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 12:32 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow

Have you analysed statistics on the ORC table? How many rows are there?

Also send the outp of

desc formatted statistics 

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 16 April 2016 at 08:20, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”

My table was created something like :
  CREATE TABLE myTable (
  bookingsDOUBLE
  , deal views  INT
  )
   STORED AS ORC or PARQUET
 PARTITION BY (event_date STRING)

PARQUET take 9second of cumulative CPU
ORC take 50second of cumulative CPU.

For ORC I have tried to 
hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
But it didn’t change anything

I am missing something, or parquet is better for this type of query?

I am using spark 1.6.0 with hive 1.1.0

thanks





Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Jörn Franke

Generally a recommendation (besides the issue) - Do not put dates as String. I 
recommend here to make them ints. It will be in both cases much faster.

It could be that you load them differently in the tables. Generally for these 
tables you should insert them in both cases sorted into the tables.
It could be also that in one case you compress the file and in the other not. 
It is always a good practice to have all options in the create table statement 
- even the default ones.

Hive seems a little bit outdated. Do you use Spark as an execution engine? Then 
you should upgrade to newer versions of Hive. The Spark execution engine on 
hive is still a little bit more experimental than TEZ. Depends also which 
distribution you are using.

Normally I would expect both of them to perform similarly.

> On 16 Apr 2016, at 09:20, Maurin Lenglart  wrote:
> 
> Hi,
> I am executing one query : 
> “SELECT `event_date` as `event_date`,sum(`bookings`) as 
> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` 
> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 
> 2”
> 
> My table was created something like :
>   
> CREATE TABLE myTable (
>   bookingsDOUBLE
>   , deal views  INT
>   )
>STORED AS ORC or PARQUET
>  PARTITION BY (event_date STRING)
> 
> PARQUET take 9second of cumulative CPU
> ORC take 50second of cumulative CPU. 
> 
> For ORC I have tried to 
> hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
> But it didn’t change anything
> 
> I am missing something, or parquet is better for this type of query?
> 
> I am using spark 1.6.0 with hive 1.1.0
> 
> thanks
> 
> 


Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Mich Talebzadeh
Have you analysed statistics on the ORC table? How many rows are there?

Also send the outp of

desc formatted statistics 

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 16 April 2016 at 08:20, Maurin Lenglart  wrote:

> Hi,
> I am executing one query :
> “SELECT `event_date` as `event_date`,sum(`bookings`) as
> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
> LIMIT 2”
>
> My table was created something like :
>   CREATE TABLE myTable (
>   bookingsDOUBLE
>   , deal views  INT
>   )
>STORED AS ORC or PARQUET
>  PARTITION BY (event_date STRING)
>
> PARQUET take 9second of cumulative CPU
> ORC take 50second of cumulative CPU.
>
> For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown
> ”,“true”)
> But it didn’t change anything
>
> I am missing something, or parquet is better for this type of query?
>
> I am using spark 1.6.0 with hive 1.1.0
>
> thanks
>
>
>


orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”

My table was created something like :
  CREATE TABLE myTable (
  bookingsDOUBLE
  , deal views  INT
  )
   STORED AS ORC or PARQUET
 PARTITION BY (event_date STRING)

PARQUET take 9second of cumulative CPU
ORC take 50second of cumulative CPU.

For ORC I have tried to 
hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown”,“true”)
But it didn’t change anything

I am missing something, or parquet is better for this type of query?

I am using spark 1.6.0 with hive 1.1.0

thanks