[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

Dmitriy V. Ryaboy (JIRA) Mon, 10 Aug 2009 15:11:41 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741589#action_12741589
 ]


Dmitriy V. Ryaboy commented on PIG-845:
---------------------------------------

Some Comments below.
It's a big patch, so a lot of comments...

1. 
EndOfAllInput flags -- could you add comments here about what the point of this 
flag is? You explain what EndOfAllInputSetter does (which is actually rather 
self-explanatory) but not what the meaning of the flag is and how it's used. 
There is a bit of an explanation in PigMapBase, but it really belongs here.

2.
Could you explain the relationship between EndOfAllInput and (deleted) POStream?

3.
Comments in MRCompiler alternate between referring to the left MROp as 
LeftMROper and curMROper. Choose one.

4.
I am curious about the decision to throw compiler exceptions if MergeJoin 
requirements re number of inputs, etc, aren't satisfied. It seems like a better 
user experience would be to log a warning and fall back to a regular join.

5.
Style notes for visitMergeJoin: 

It's a 200-line method. Any way you can break it up into smaller components? As 
is, it's hard to follow.

The if statements should be broken up into multiple lines to agree with the 
style guides.

Variable naming: you've got topPrj, prj, pkg, lr, ce, nig.. one at a time they 
are fine, but together in a 200-line method they are undreadable. Please 
consider more descriptive names.

6.
Kind of a global comment, since it applies to more than just MergeJoin:

It seems to me like we need a Builder for operators to clean up some of the 
new, set, set, set stuff.

Having the setters return this and a Plan's add() method return the plan, would 
let us replace this:

POProject topPrj = new POProject(new 
OperatorKey(scope,nig.getNextNodeId(scope)));
topPrj.setColumn(1);
topPrj.setResultType(DataType.TUPLE);
topPrj.setOverloaded(true);
rightMROpr.reducePlan.add(topPrj);
rightMROpr.reducePlan.connect(pkg, topPrj);

with this:

POProject topPrj = new POProject(new 
OperatorKey(scope,nig.getNextNodeId(scope)))
        .setColumn(1).setResultType(DataType.TUPLE)
        .setOverloaded(true);

rightMROpr.reducePlan.add(topPrj).connect(pkg, topPrj)


7.
Is the change to List<List<Byte>> keyTypes in POFRJoin related to MergeJoin or 
just rolled in?

8. MergeJoin

break getNext() into components.

I don't see you supporting Left outer joins. Plans for that? At least document 
the planned approach.

Error codes being declared deep inside classes, and documented on the wiki, is 
a poor practice, imo. They should be pulled out into PigErrors (as lightweight 
final objects that have an error code, a name, and a description..) I thought 
Santhosh made progress on this already, no?

Could you explain the problem with splits and streams? Why can't this work for 
them?


9. Sampler/Indexer:
9a
Looks like you create the same number of map tasks for this as you do for a 
join; all a sampling map task does is read one record and emit a single tuple.  
That seems wasteful; there is a lot of overhead in setting up these tiny jobs 
which might get stuck behind other jobs running on the cluster, etc. If the 
underlying file has syncpoints, a smaller number of MR tasks can be created. If 
we know the ratio of sample tasks to "full" tasks, we can figure out how many 
records we should emit per job ( ceil(full_tasks/sample_tasks) ).  We can 
approximately achieve this through seeking trough (end-offset)/num_to_emit and 
doing a sync() after that seek. It's approximate, but close enough for an index.

9b
Consider renaming to something like SortedFileIndexer, since it's coneivable 
that this component can be reused in a context other than a Merge Join.

10.
Would it make sense to expose this to the users via a 'CREATE INDEX' (or 
similar) command?
That way the index could be persisted, and the user could tell you to use an 
existing index instead of rescanning the data.

11.
I am not sure about the approach of pushing sampling above filters. Have you 
guys benchmarked this? Seems like you'd wind up reading the whole file in the 
sample job if the filter is selective enough (and high filter selectivity would 
also make materialize->sample go much faster).

Testing: 
12a
You should test for refusal to do 3-way join and other error condition (or a 
warning and successful failover to regular join -- my preference)

12b
You should do a proper unit test for the MergeJoinIndexer (or whatever we are 
calling it).



> PERFORMANCE: Merge Join
> -----------------------
>
>                 Key: PIG-845
>                 URL: https://issues.apache.org/jira/browse/PIG-845
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Ashutosh Chauhan
>         Attachments: merge-join-1.patch, merge-join-for-review.patch
>
>
> Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

Reply via email to