Hi all, I have some trouble understanding how Pig's JOIN operator differs from Hive's.
As noted in some of my pervious mails, I ran benchmarks and found Pig to outperform Hive when it comes to joins. Hive's join operations are significantly slower than in Pig. My past results were confirmed by running additional benchmarks, including two transitive self-join experiments on datasets consisting of 1,000 and 10,000,000 records. The former took Pig an average of 45.36 seconds (real time runtime) to execute; it took Hive 56.73 seconds. The latter took Pig 157.97 and Hive 180.19 seconds (again, on average). Could the fact that Hive's join implementation is less efficient than Pig's can be explained by examining Hive's join operator implementation org.apache.hadoop.hive.ql.exec.CommonJoinOperator.java? In essence, the methods responsible for the join in Hive recursively produce arrays of bitvectors. Each entry in the bitvector denotes whether an element is to be used in the join. (if the element is null, then it won't be used). Not only does recursion result in a large memory footprint, but it also means that an unnecessary amount of objects are created for each join operation (object creation in Java is expensive, making the entire join operations a lot less efficient than Pig's). Or am I completely missing the point and Pig uses a completely different approach to joining data? Thanks, Ben