[jira] [Commented] (PIG-2293) Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.

Aaron Klish (Commented) (JIRA) Fri, 30 Sep 2011 16:02:11 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118549#comment-13118549
 ]


Aaron Klish commented on PIG-2293:
----------------------------------

Try this in local mode (This is not an infinite loop - I must have been 
mistaken about that).

[klish@gwgd4008 pig_patch]$ cat a.txt
1       2       3
4       2       1
4       3       3
7       2       5
8       4       3
8       3       4
12      3       4
20      1       2
28      4       1
[klish@gwgd4008 pig_patch]$ cat b.txt
1       3
2       7
2       9
2       4
4       6
4       9
8       9
[klish@gwgd4008 pig_patch]$ cat index_join.pig
A = LOAD './a.txt'  AS (a1:int, a2:int, a3:int);
B = LOAD './b.txt'  AS (b1:int, b2:int);

C = join A by a1, B by b1 USING 'merge-sparse';

DUMP C;

I get the following exception:

org.apache.pig.backend.executionengine.ExecException: ERROR 1102: Data is not 
sorted on right side. Last two keys encountered were:
2
1
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:359)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

                
> Pig should support a more efficient merge join against data sources that 
> natively support point lookups or where the join is against large, sparse 
> tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2293
>                 URL: https://issues.apache.org/jira/browse/PIG-2293
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Aaron Klish
>            Assignee: Aaron Klish
>             Fix For: 0.10
>
>         Attachments: PIG-2293-1.patch, PIG-2293-2.patch, PIG-2293-3.patch, 
> e2e_test.txt, patch.txt, patch.txt
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
>    1. It assumes the right side of the table must be accessed sequentially - 
> record by record.
>    2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface 
> IndexableLoadFunc.  This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next 
> record).  
> The merge join physical operator only calls 'seekNear' for the first key in 
> each split (effectively eliminating splits
> where the first and subsequent keys will not be found).  Subsequent joins are 
> found by reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly 
> against large sparse tables or data sources that support 
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG 
> latin.  When specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key 
> (rather than just the first in each split).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2293) Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.

Reply via email to