Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra

Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE clause 
even though spark.sql.orc.filterPushdown=true) can be related to some 
factors below ?


- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on this 
part. Can you please open a JIRA to add some debug information on the 
driver side?


Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee > wrote:


I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). 
But from the log No ORC pushdown predicate for my query with WHERE 
clause.


15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is 
right based on your setting, : leaf-0 = (EQUALS x 320)


The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee  wrote:


Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z 
from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z 
>= 2 and z <= 8", column "x", "y" is not partition column, the 
others are partition columns. I expected the system will use 
predicate pushdown. I turned on the debug and found pushdown 
predicate was not generated ("DEBUG OrcInputFormat: No ORC pushdown 
predicate")


Then I tried to set the search argument explicitly (on the column 
"x" which is not partition column)


   val xs = 
SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results 
was wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: 
leaf-0 = (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is 
not applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?


The potential reason of the speed difference I can think of may be 
the different versions of OrcInputFormat. The hive path may use 
NewOrcInputFormat, but the spark path use OrcInputFormat.


Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee  
wrote:


Yes, the predicate pushdown is enabled, but still take longer 
time than the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
 wrote:



Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 
files and each file size is about 245MB.


What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from 
the temp table


val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 
6000 files) , the second case is much slower then the first 
one. Any ideas why?


BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: 
user-h...@spark.apache.org


















Re: sql query orc slow

2015-10-13 Thread Zhan Zhang
Hi Patcharee,

I am not sure which side is wrong, driver or executor. If it is executor side, 
the reason you mentioned may be possible. But if the driver side didn’t set the 
predicate at all, then somewhere else is broken.

Can you please file a JIRA with a simple reproduce step, and let me know the 
JIRA number?

Thanks.

Zhan Zhang

On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra 
> wrote:

Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE clause even 
though spark.sql.orc.filterPushdown=true) can be related to some factors below ?

- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on this part. 
Can you please open a JIRA to add some debug information on the driver side?

Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the 
log No ORC pushdown predicate for my query with WHERE clause.

15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is right 
based on your setting, : leaf-0 = (EQUALS x 320)

The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 
- v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y 
= 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is 
not partition column, the others are partition columns. I expected the system 
will use predicate pushdown. I turned on the debug and found pushdown predicate 
was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate")

Then I tried to set the search argument explicitly (on the column "x" which is 
not partition column)

   val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()
   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was wrong (no 
results at all)

15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS 
x 320)
expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not applied 
by the system?

BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:
Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
>Partition pruning and predicate pushdown does not have effect. Do you see big 
>IO difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Yes, the predicate pushdown is enabled, but still take longer time than the 
first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:
Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail:  
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 

Re: sql query orc slow

2015-10-13 Thread Patcharee Thongtra

Hi Zhan Zhang,

Here is the issue https://issues.apache.org/jira/browse/SPARK-11087

BR,
Patcharee

On 10/13/2015 06:47 PM, Zhan Zhang wrote:

Hi Patcharee,

I am not sure which side is wrong, driver or executor. If it is 
executor side, the reason you mentioned may be possible. But if the 
driver side didn’t set the predicate at all, then somewhere else is 
broken.


Can you please file a JIRA with a simple reproduce step, and let me 
know the JIRA number?


Thanks.

Zhan Zhang

On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra 
> wrote:



Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE 
clause even though spark.sql.orc.filterPushdown=true) can be related 
to some factors below ?


- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee


On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on 
this part. Can you please open a JIRA to add some debug information 
on the driver side?


Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee  
wrote:


I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). 
But from the log No ORC pushdown predicate for my query with WHERE 
clause.


15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate 
is right based on your setting, : leaf-0 = (EQUALS x 320)


The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee  
wrote:



Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z 
from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and 
z >= 2 and z <= 8", column "x", "y" is not partition column, the 
others are partition columns. I expected the system will use 
predicate pushdown. I turned on the debug and found pushdown 
predicate was not generated ("DEBUG OrcInputFormat: No ORC 
pushdown predicate")


Then I tried to set the search argument explicitly (on the column 
"x" which is not partition column)


   val xs = 
SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results 
was wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: 
leaf-0 = (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is 
not applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?


The potential reason of the speed difference I can think of may 
be the different versions of OrcInputFormat. The hive path may 
use NewOrcInputFormat, but the spark path use OrcInputFormat.


Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
 wrote:


Yes, the predicate pushdown is enabled, but still take longer 
time than the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
 wrote:



Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 
files and each file size is about 245MB.


What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from 
the temp table


val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 
6000 files) , the second case is much slower then the first 
one. Any ideas why?


BR,




-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org






















Re: sql query orc slow

2015-10-09 Thread patcharee
Yes, the predicate pushdown is enabled, but still take longer time than 
the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee  wrote:


Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
>Partition pruning and predicate pushdown does not have effect. Do you see big 
>IO difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee  wrote:

> Yes, the predicate pushdown is enabled, but still take longer time than the 
> first method
> 
> BR,
> Patcharee
> 
> On 08. okt. 2015 18:43, Zhan Zhang wrote:
>> Hi Patcharee,
>> 
>> Did you enable the predicate pushdown in the second method?
>> 
>> Thanks.
>> 
>> Zhan Zhang
>> 
>> On Oct 8, 2015, at 1:43 AM, patcharee  wrote:
>> 
>>> Hi,
>>> 
>>> I am using spark sql 1.5 to query a hive table stored as partitioned orc 
>>> file. We have the total files is about 6000 files and each file size is 
>>> about 245MB.
>>> 
>>> What is the difference between these two query methods below:
>>> 
>>> 1. Using query on hive table directly
>>> 
>>> hiveContext.sql("select col1, col2 from table1")
>>> 
>>> 2. Reading from orc file, register temp table and query from the temp table
>>> 
>>> val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
>>> c.registerTempTable("regTable")
>>> hiveContext.sql("select col1, col2 from regTable")
>>> 
>>> When the number of files is large (query all from the total 6000 files) , 
>>> the second case is much slower then the first one. Any ideas why?
>>> 
>>> BR,
>>> 
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> 
>>> 
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
In your case, you manually set an AND pushdown, and the predicate is right 
based on your setting, : leaf-0 = (EQUALS x 320)

The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee 
> wrote:

Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 
- v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y 
= 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is 
not partition column, the others are partition columns. I expected the system 
will use predicate pushdown. I turned on the debug and found pushdown predicate 
was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate")

Then I tried to set the search argument explicitly (on the column "x" which is 
not partition column)

   val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()
   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was wrong (no 
results at all)

15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS 
x 320)
expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not applied 
by the system?

BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:
Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
>Partition pruning and predicate pushdown does not have effect. Do you see big 
>IO difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
> wrote:

Yes, the predicate pushdown is enabled, but still take longer time than the 
first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:
Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
> wrote:

Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org








Re: sql query orc slow

2015-10-09 Thread patcharee
I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But 
from the log No ORC pushdown predicate for my query with WHERE clause.


15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is 
right based on your setting, : leaf-0 = (EQUALS x 320)


The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee > wrote:



Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 
4D where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 
and z <= 8", column "x", "y" is not partition column, the others are 
partition columns. I expected the system will use predicate pushdown. 
I turned on the debug and found pushdown predicate was not generated 
("DEBUG OrcInputFormat: No ORC pushdown predicate")


Then I tried to set the search argument explicitly (on the column "x" 
which is not partition column)


   val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was 
wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 
= (EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not 
applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be 
applied. Partition pruning and predicate pushdown does not have 
effect. Do you see big IO difference between two methods?


The potential reason of the speed difference I can think of may be 
the different versions of OrcInputFormat. The hive path may use 
NewOrcInputFormat, but the spark path use OrcInputFormat.


Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee > wrote:


Yes, the predicate pushdown is enabled, but still take longer time 
than the first method


BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee > wrote:



Hi,

I am using spark sql 1.5 to query a hive table stored as 
partitioned orc file. We have the total files is about 6000 files 
and each file size is about 245MB.


What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the 
temp table


val c = 
hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")

c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 
files) , the second case is much slower then the first one. Any 
ideas why?


BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: user-h...@spark.apache.org 















Re: sql query orc slow

2015-10-09 Thread patcharee

Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D 
where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z 
<= 8", column "x", "y" is not partition column, the others are partition 
columns. I expected the system will use predicate pushdown. I turned on 
the debug and found pushdown predicate was not generated ("DEBUG 
OrcInputFormat: No ORC pushdown predicate")


Then I tried to set the search argument explicitly (on the column "x" 
which is not partition column)


val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()

hiveContext.setConf("hive.io.file.readcolumn.names", "x")
hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was 
wrong (no results at all)


15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)

expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not 
applied by the system?


BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:

Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
Partition pruning and predicate pushdown does not have effect. Do you see big IO 
difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee  wrote:


Yes, the predicate pushdown is enabled, but still take longer time than the 
first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:

Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee  wrote:


Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
That is weird. Unfortunately, there is no debug info available on this part. 
Can you please open a JIRA to add some debug information on the driver side?

Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee 
> wrote:

I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the 
log No ORC pushdown predicate for my query with WHERE clause.

15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is right 
based on your setting, : leaf-0 = (EQUALS x 320)

The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang







On Oct 9, 2015, at 9:58 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 
- v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y 
= 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is 
not partition column, the others are partition columns. I expected the system 
will use predicate pushdown. I turned on the debug and found pushdown predicate 
was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate")

Then I tried to set the search argument explicitly (on the column "x" which is 
not partition column)

   val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()
   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was wrong (no 
results at all)

15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS 
x 320)
expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not applied 
by the system?

BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:
Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
>Partition pruning and predicate pushdown does not have effect. Do you see big 
>IO difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Yes, the predicate pushdown is enabled, but still take longer time than the 
first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:
Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
<patcharee.thong...@uni.no>
 wrote:

Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,




-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail:  
user-h...@spark.apache.org










Re: sql query orc slow

2015-10-08 Thread Zhan Zhang
Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee  wrote:

> Hi,
> 
> I am using spark sql 1.5 to query a hive table stored as partitioned orc 
> file. We have the total files is about 6000 files and each file size is about 
> 245MB.
> 
> What is the difference between these two query methods below:
> 
> 1. Using query on hive table directly
> 
> hiveContext.sql("select col1, col2 from table1")
> 
> 2. Reading from orc file, register temp table and query from the temp table
> 
> val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
> c.registerTempTable("regTable")
> hiveContext.sql("select col1, col2 from regTable")
> 
> When the number of files is large (query all from the total 6000 files) , the 
> second case is much slower then the first one. Any ideas why?
> 
> BR,
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org