Re: sql query orc slow

Zhan Zhang Tue, 13 Oct 2015 09:48:14 -0700

Hi Patcharee,

I am not sure which side is wrong, driver or executor. If it is executor side, 
the reason you mentioned may be possible. But if the driver side didn’t set the 
predicate at all, then somewhere else is broken.

Can you please file a JIRA with a simple reproduce step, and let me know the 
JIRA number?

Thanks.

Zhan Zhang

On Oct 13, 2015, at 1:01 AM, Patcharee Thongtra 
<patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>> wrote:

Hi Zhan Zhang,

Is my problem (which is ORC predicate is not generated from WHERE clause even 
though spark.sql.orc.filterPushdown=true) can be related to some factors below ?

- orc file version (File Version: 0.12 with HIVE_8732)
- hive version (using Hive 1.2.1.2.3.0.0-2557)
- orc table is not sorted / indexed
- the split strategy hive.exec.orc.split.strategy

BR,
Patcharee

On 10/09/2015 08:01 PM, Zhan Zhang wrote:
That is weird. Unfortunately, there is no debug info available on this part. 
Can you please open a JIRA to add some debug information on the driver side?

Thanks.

Zhan Zhang

On Oct 9, 2015, at 10:22 AM, patcharee 
<<mailto:patcharee.thong...@uni.no>patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>>
 wrote:

I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the 
log No ORC pushdown predicate for my query with WHERE clause.

15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate

I did not understand what wrong with this.

BR,
Patcharee

On 09. okt. 2015 19:10, Zhan Zhang wrote:
In your case, you manually set an AND pushdown, and the predicate is right 
based on your setting, : leaf-0 = (EQUALS x 320)

The right way is to enable the predicate pushdown as follows.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true”)

Thanks.

Zhan Zhang

On Oct 9, 2015, at 9:58 AM, patcharee 
<<mailto:patcharee.thong...@uni.no>patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>>
 wrote:

Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 
- v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y 
= 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is 
not partition column, the others are partition columns. I expected the system 
will use predicate pushdown. I turned on the debug and found pushdown predicate 
was not generated ("DEBUG OrcInputFormat: No ORC pushdown predicate")

Then I tried to set the search argument explicitly (on the column "x" which is 
not partition column)

   val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()
   hiveContext.setConf("hive.io.file.readcolumn.names", "x")
   hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was wrong (no 
results at all)

15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (EQUALS 
x 320)
expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not applied 
by the system?

BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:
Hi Patcharee,

>From the query, it looks like only the column pruning will be applied. 
>Partition pruning and predicate pushdown does not have effect. Do you see big 
>IO difference between two methods?

The potential reason of the speed difference I can think of may be the 
different versions of OrcInputFormat. The hive path may use NewOrcInputFormat, 
but the spark path use OrcInputFormat.

Thanks.

Zhan Zhang

On Oct 8, 2015, at 11:55 PM, patcharee 
<<mailto:patcharee.thong...@uni.no>patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>>
 wrote:

Yes, the predicate pushdown is enabled, but still take longer time than the 
first method

BR,
Patcharee

On 08. okt. 2015 18:43, Zhan Zhang wrote:
Hi Patcharee,

Did you enable the predicate pushdown in the second method?

Thanks.

Zhan Zhang

On Oct 8, 2015, at 1:43 AM, patcharee 
<<mailto:patcharee.thong...@uni.no>patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>>
 wrote:

Hi,

I am using spark sql 1.5 to query a hive table stored as partitioned orc file. 
We have the total files is about 6000 files and each file size is about 245MB.

What is the difference between these two query methods below:

1. Using query on hive table directly

hiveContext.sql("select col1, col2 from table1")

2. Reading from orc file, register temp table and query from the temp table

val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
c.registerTempTable("regTable")
hiveContext.sql("select col1, col2 from regTable")

When the number of files is large (query all from the total 6000 files) , the 
second case is much slower then the first one. Any ideas why?

BR,

---------------------------------------------------------------------
To unsubscribe, e-mail: <mailto:user-unsubscr...@spark.apache.org> 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: <mailto:user-h...@spark.apache.org> 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: sql query orc slow

Reply via email to