I’ve been trying to benchmark some of the Hive enhancements in Hadoop 2.0 using 
the HDP Sandbox. 

I took one of their example queries and executed it with the tables stored as 
TEXTFILE, RCFILE, and ORC. I also tried enabling enabling vectorized execution, 
and predicate pushdown.

SELECT s07.description, s07.salary, s08.salary,
  s08.salary - s07.salary
FROM
  sample_07 s07 JOIN sample_08 s08
ON ( s07.code = s08.code)
WHERE
 s07.salary < s08.salary
SORT BY s08.salary-s07.salary DESC

Ultimately there was not much different performance in any of the executions, 
can someone clarify for me if I need an actual full cluster to see performance 
improvements, or if I’m missing something else. I thought at minimum I would 
have seen an improvement moving to ORC from TEXTFILE.

Reply via email to