Hi professor Gopal, > Most of your ~300s looks to be the fixed overheads of setting up each task. Maybe you are right. Perhaps the orc indexes work normally in hive, Just because the fixed time overhead is too long, so I think the performance improement is not obvious, I will check this later.
But, I care more about sparksql, does it support orc indexes completely? I user the shell script "${SPARK_HOME}/bin/spark-sql" to run sparksql REPL and execute my query statement. The following is my test in sparksql REPL: spark-sql>set spark.sql.orc.filterPushdown=true; spark-sql>select count(*) from gprs where terminal_type=25080; Time taken: about 5 senconds spark-sql>select * from gprs where terminal_type=25080; Time taken: about 107 senconds Both of the two query statements would not scan the whole data (if used file stats), but why was the time gap so large? spark-sql>set spark.sql.orc.filterPushdown=false; spark-sql>select count(*) from gprs where terminal_type=25080; Time taken: about 5 senconds spark-sql>select * from gprs where terminal_type=25080; Time taken: about 107 senconds So, when I disaled spark.sql.orc.filterPushdown, there was no difference (I mean select * from ...) of time taken relative to enable spark.sql.orc.filterPushdown. I have tried explain extended command, but it did not show any information that indicated the query statement had used ORC stats. Is there any way to check the use of stats? Joseph From: Gopal Vijayaraghavan Date: 2016-03-16 22:18 To: user@hive.apache.org CC: Joseph Subject: Re: The build-in indexes in ORC file does not work. > I have tried bloom filter ,but it makes no improvement。I know about > tez, but never use, I will try it later. ... > select count(*) from gprs where terminal_type=25080; > will not scan data > Time taken: 353.345 seconds CombineInputFormat does not do any split-elimination, so MapReduce does not get container speedups there. Most of your ~300s looks to be the fixed overheads of setting up each task. We could not fix this in MRv2 due to historical compatibility issues with merge-joins & schema evolution (see HiveSplitGenerator.java). This is not recommended for regular use (other than in Tez), but you can force split-elimination with set hive.input.format=${hive.tez.input.format}; >>>> So, has anyone used ORC's build-in indexes before (especially in >>>>spark SQL)? What's my issue? We work on SparkSQL perf issues as well - this has to do with OrcRelation https://github.com/apache/spark/pull/10938 + https://github.com/apache/spark/pull/10842 Cheers, Gopal