Re: Re: The build-in indexes in ORC file does not work.

Joseph Sat, 19 Mar 2016 01:47:08 -0700

Hi professor Gopal,

> Most of your ~300s looks to be the fixed overheads of setting up each task.
Maybe you are right. Perhaps the orc indexes work normally in hive, Just 
because the fixed time overhead is too long, so I think the performance 
improement is not obvious, I will check this later.


But, I care more about sparksql, does it support orc indexes completely?

I user  the shell script "${SPARK_HOME}/bin/spark-sql" to run sparksql REPL and 
execute my query statement.
The following is my test in sparksql REPL：
spark-sql>set spark.sql.orc.filterPushdown=true;
spark-sql>select count(*) from gprs where terminal_type=25080;    Time taken: 
about 5 senconds
spark-sql>select * from gprs where terminal_type=25080;                Time 
taken: about 107 senconds

Both of the two query statements would not scan the whole data (if used file 
stats), but why was the time gap so large?

spark-sql>set spark.sql.orc.filterPushdown=false;
spark-sql>select count(*) from gprs where terminal_type=25080;    Time taken: 
about 5 senconds
spark-sql>select * from gprs where terminal_type=25080;                Time 
taken: about 107 senconds

So, when I disaled spark.sql.orc.filterPushdown,  there was no difference (I 
mean select * from ...) of time taken relative to enable 
spark.sql.orc.filterPushdown. 

I have tried explain extended command, but it did not show any information that 
indicated the query statement had used ORC stats.
Is there any way to check the use of stats? 




Joseph
 
From: Gopal Vijayaraghavan
Date: 2016-03-16 22:18
To: user@hive.apache.org
CC: Joseph
Subject: Re: The build-in indexes in ORC file does not work.
 
 
> I have tried  bloom filter ,but it makes no improvement。I know about
> tez, but never use, I will try it later.
...
>    select count(*) from gprs where terminal_type=25080;
>   will not scan data
>      Time taken: 353.345 seconds
 
CombineInputFormat does not do any split-elimination, so MapReduce does
not get container speedups there.
 
Most of your ~300s looks to be the fixed overheads of setting up each task.
 
We could not fix this in MRv2 due to historical compatibility issues with
merge-joins & schema evolution (see HiveSplitGenerator.java).
 
This is not recommended for regular use (other than in Tez), but you can
force split-elimination with
 
 
set hive.input.format=${hive.tez.input.format};
 
>>>> So,  has anyone used ORC's build-in indexes before (especially in
>>>>spark SQL)?  What's my issue?
 
We work on SparkSQL perf issues as well - this has to do with OrcRelation
 
https://github.com/apache/spark/pull/10938
 
+
https://github.com/apache/spark/pull/10842
 
 
Cheers,
Gopal

Re: Re: The build-in indexes in ORC file does not work.

Reply via email to