[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741589#action_12741589 ]
Dmitriy V. Ryaboy commented on PIG-845: --------------------------------------- Some Comments below. It's a big patch, so a lot of comments... 1. EndOfAllInput flags -- could you add comments here about what the point of this flag is? You explain what EndOfAllInputSetter does (which is actually rather self-explanatory) but not what the meaning of the flag is and how it's used. There is a bit of an explanation in PigMapBase, but it really belongs here. 2. Could you explain the relationship between EndOfAllInput and (deleted) POStream? 3. Comments in MRCompiler alternate between referring to the left MROp as LeftMROper and curMROper. Choose one. 4. I am curious about the decision to throw compiler exceptions if MergeJoin requirements re number of inputs, etc, aren't satisfied. It seems like a better user experience would be to log a warning and fall back to a regular join. 5. Style notes for visitMergeJoin: It's a 200-line method. Any way you can break it up into smaller components? As is, it's hard to follow. The if statements should be broken up into multiple lines to agree with the style guides. Variable naming: you've got topPrj, prj, pkg, lr, ce, nig.. one at a time they are fine, but together in a 200-line method they are undreadable. Please consider more descriptive names. 6. Kind of a global comment, since it applies to more than just MergeJoin: It seems to me like we need a Builder for operators to clean up some of the new, set, set, set stuff. Having the setters return this and a Plan's add() method return the plan, would let us replace this: POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope))); topPrj.setColumn(1); topPrj.setResultType(DataType.TUPLE); topPrj.setOverloaded(true); rightMROpr.reducePlan.add(topPrj); rightMROpr.reducePlan.connect(pkg, topPrj); with this: POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope))) .setColumn(1).setResultType(DataType.TUPLE) .setOverloaded(true); rightMROpr.reducePlan.add(topPrj).connect(pkg, topPrj) 7. Is the change to List<List<Byte>> keyTypes in POFRJoin related to MergeJoin or just rolled in? 8. MergeJoin break getNext() into components. I don't see you supporting Left outer joins. Plans for that? At least document the planned approach. Error codes being declared deep inside classes, and documented on the wiki, is a poor practice, imo. They should be pulled out into PigErrors (as lightweight final objects that have an error code, a name, and a description..) I thought Santhosh made progress on this already, no? Could you explain the problem with splits and streams? Why can't this work for them? 9. Sampler/Indexer: 9a Looks like you create the same number of map tasks for this as you do for a join; all a sampling map task does is read one record and emit a single tuple. That seems wasteful; there is a lot of overhead in setting up these tiny jobs which might get stuck behind other jobs running on the cluster, etc. If the underlying file has syncpoints, a smaller number of MR tasks can be created. If we know the ratio of sample tasks to "full" tasks, we can figure out how many records we should emit per job ( ceil(full_tasks/sample_tasks) ). We can approximately achieve this through seeking trough (end-offset)/num_to_emit and doing a sync() after that seek. It's approximate, but close enough for an index. 9b Consider renaming to something like SortedFileIndexer, since it's coneivable that this component can be reused in a context other than a Merge Join. 10. Would it make sense to expose this to the users via a 'CREATE INDEX' (or similar) command? That way the index could be persisted, and the user could tell you to use an existing index instead of rescanning the data. 11. I am not sure about the approach of pushing sampling above filters. Have you guys benchmarked this? Seems like you'd wind up reading the whole file in the sample job if the filter is selective enough (and high filter selectivity would also make materialize->sample go much faster). Testing: 12a You should test for refusal to do 3-way join and other error condition (or a warning and successful failover to regular join -- my preference) 12b You should do a proper unit test for the MergeJoinIndexer (or whatever we are calling it). > PERFORMANCE: Merge Join > ----------------------- > > Key: PIG-845 > URL: https://issues.apache.org/jira/browse/PIG-845 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Ashutosh Chauhan > Attachments: merge-join-1.patch, merge-join-for-review.patch > > > Thsi join would work if the data for both tables is sorted on the join key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.