I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using the HDP Sandbox.
I took one of their example queries and executed it with the tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, and predicate pushdown. SELECT s07.description, s07.salary, s08.salary, s08.salary - s07.salary FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC Ultimately there was not much different performance in any of the executions, can someone clarify for me if I need an actual full cluster to see performance improvements, or if I’m missing something else. I thought at minimum I would have seen an improvement moving to ORC from TEXTFILE.