Hi all,

I have some trouble understanding how Pig's JOIN operator differs from
Hive's.

As noted in some of my pervious mails, I ran benchmarks and found Pig to
outperform Hive when it comes to joins. Hive's join operations are
significantly slower than in Pig. My past results were confirmed by running
additional benchmarks, including two transitive self-join experiments on
datasets consisting of 1,000 and 10,000,000 records. The former took Pig an
average of 45.36 seconds (real time runtime) to execute; it took Hive 56.73
seconds. The latter took  Pig 157.97  and Hive 180.19 seconds (again, on
average).


Could the fact that Hive's join implementation is less efficient than Pig's
can be explained by examining Hive's join operator implementation
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.java? In essence, the
methods responsible for the join in Hive recursively produce arrays of
bitvectors. Each entry in the bitvector denotes whether an element is to be
used in the join. (if the element is null, then it won't be used). Not only
does recursion result in a large memory footprint, but it also means that
an unnecessary amount of objects are created for each join operation
(object creation in Java is expensive, making the entire join operations a
lot less efficient than Pig's).

Or am I completely missing the point and Pig uses a completely different
approach to joining data?

Thanks,
Ben

Reply via email to