> I have tried bloom filter ,but it makes no improvement。I know about
> tez, but never use, I will try it later.
...
> select count(*) from gprs where terminal_type=25080;
> will not scan data
> Time taken: 353.345 seconds
CombineInputFormat does not do any split-elimination, so MapReduce does
not get container speedups there.
Most of your ~300s looks to be the fixed overheads of setting up each task.
We could not fix this in MRv2 due to historical compatibility issues with
merge-joins & schema evolution (see HiveSplitGenerator.java).
This is not recommended for regular use (other than in Tez), but you can
force split-elimination with
set hive.input.format=${hive.tez.input.format};
>>>> So, has anyone used ORC's build-in indexes before (especially in
>>>>spark SQL)? What's my issue?
We work on SparkSQL perf issues as well - this has to do with OrcRelation
https://github.com/apache/spark/pull/10938
+
https://github.com/apache/spark/pull/10842
Cheers,
Gopal