[
https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120268#comment-13120268
]
Aaron Klish commented on PIG-2293:
----------------------------------
Hi Daniel,
Unfortunately, there are still problems with this patch. Here is an infinite
loop scenario:
Make a simple file called x:
for ((i=0;i<1000;i++)); do echo $i; done > x
Make another file like z:
[klish@gwgd4007 pig_patch]$ cat z
1
999
And a simple script like:
A = LOAD './z' AS (a1:int);
B = LOAD './x' AS (b1:int);
C = join A by a1, B by b1 USING 'merge-sparse';
DUMP C;
If you look at the way DefaultIndexableLoader is written, it creates a new
ReadToEndLoader for every call to seekNear.
It was never designed for multiple calls to seekNear. Short of a some
refactoring of this class, I don't see how allowing this
is a good idea.
loader = new
ReadToEndLoader((LoadFunc)PigContext.instantiateFuncFromSpec(rightLoaderFuncSpec),
conf, inpLocation, splitsToBeRead);
> Pig should support a more efficient merge join against data sources that
> natively support point lookups or where the join is against large, sparse
> tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PIG-2293
> URL: https://issues.apache.org/jira/browse/PIG-2293
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.9.0
> Reporter: Aaron Klish
> Assignee: Aaron Klish
> Fix For: 0.10
>
> Attachments: PIG-2293-1.patch, PIG-2293-2.patch, PIG-2293-3.patch,
> PIG-2293-4.patch, e2e_test.txt, patch.txt, patch.txt
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
> 1. It assumes the right side of the table must be accessed sequentially -
> record by record.
> 2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface
> IndexableLoadFunc. This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next
> record).
> The merge join physical operator only calls 'seekNear' for the first key in
> each split (effectively eliminating splits
> where the first and subsequent keys will not be found). Subsequent joins are
> found by reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly
> against large sparse tables or data sources that support
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG
> latin. When specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key
> (rather than just the first in each split).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira