[ https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118549#comment-13118549 ]
Aaron Klish commented on PIG-2293: ---------------------------------- Try this in local mode (This is not an infinite loop - I must have been mistaken about that). [klish@gwgd4008 pig_patch]$ cat a.txt 1 2 3 4 2 1 4 3 3 7 2 5 8 4 3 8 3 4 12 3 4 20 1 2 28 4 1 [klish@gwgd4008 pig_patch]$ cat b.txt 1 3 2 7 2 9 2 4 4 6 4 9 8 9 [klish@gwgd4008 pig_patch]$ cat index_join.pig A = LOAD './a.txt' AS (a1:int, a2:int, a3:int); B = LOAD './b.txt' AS (b1:int, b2:int); C = join A by a1, B by b1 USING 'merge-sparse'; DUMP C; I get the following exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1102: Data is not sorted on right side. Last two keys encountered were: 2 1 at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:359) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > Pig should support a more efficient merge join against data sources that > natively support point lookups or where the join is against large, sparse > tables. > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: PIG-2293 > URL: https://issues.apache.org/jira/browse/PIG-2293 > Project: Pig > Issue Type: New Feature > Components: impl > Affects Versions: 0.9.0 > Reporter: Aaron Klish > Assignee: Aaron Klish > Fix For: 0.10 > > Attachments: PIG-2293-1.patch, PIG-2293-2.patch, PIG-2293-3.patch, > e2e_test.txt, patch.txt, patch.txt > > Original Estimate: 336h > Remaining Estimate: 336h > > The existing PIG merge join has the following limitations: > 1. It assumes the right side of the table must be accessed sequentially - > record by record. > 2. It does not perform well against large, sparse tables. > The current implementation of the merge join introduced the interface > IndexableLoadFunc. This 'LoadFunc' > supports the ability to 'seekNear' a given key (before reading the next > record). > The merge join physical operator only calls 'seekNear' for the first key in > each split (effectively eliminating splits > where the first and subsequent keys will not be found). Subsequent joins are > found by reading sequentially through > the records on the right table looking for matches from the left table. > While this method works well for dense join tables - it performs poorly > against large sparse tables or data sources that support > point lookups natively (HBase for example). > The proposed enhancement is to add a new join type - 'merge-sparse' to PIG > latin. When specified in the PIG script, this join type > will cause the merge join operator to call seekNear on each and every key > (rather than just the first in each split). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira