[
https://issues.apache.org/jira/browse/CRUNCH-34?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437726#comment-13437726
]
Robert Chu commented on CRUNCH-34:
----------------------------------
Unfortunately the stuff I had previously been working on produces a graph type
that is proving difficult to convert to the graph format used in your patch.
Its probably best to move on without my changes.
> Refactor the MSCRPlanner logic
> ------------------------------
>
> Key: CRUNCH-34
> URL: https://issues.apache.org/jira/browse/CRUNCH-34
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.3.0
> Reporter: Josh Wills
> Assignee: Josh Wills
> Attachments: PLANNER-REFACTORING.patch
>
>
> I had a conversation with Robert awhile back about one of the shoddier areas
> of the Crunch codebase-- the planning logic. It relies on a whole bunch of
> mutable state, which makes the logic of the overall planning process
> incomprehensible to anyone except for me (back when I wrote it) and Gabriel
> (who grokked it well enough to fix some bugs in it.)
> It turns out that understanding the planning process is actually pretty easy
> if you map the logical plan to a graph that has three kinds of vertices:
> Source, Target, and GroupByKey (GBK). All of the other nodes in the logical
> plan (primarily DoCollection/DoTable instances) make up the edges of the
> graph.
> Once you take this graph perspective, you can think of the MapReduce job
> creation process entirely in terms of graph operations:
> 1) Walk the logical plan and construct the initial Graph object, which allows
> Edges to exist between GBK vertices.
> 2) Build a new graph that is identical to the first one, except it eliminates
> Edges between GBK vertices by constructing additional Source and Target
> vertices.
> 3) Identify all of the (weakly) connected components of the new graph.
> 4) Construct MapReduce jobs out of the connected components, either map-only
> jobs when there is no GBK node in the component, or MapReduce jobs when there
> is one (or a fusion job when there is more than one.)
> I've been working on this off-and-on for a couple of weeks, and I have a
> version of the planning code that implements the description above and passes
> all of our tests. There are still places where we have mutable state that
> will need to be cleaned up, but I think this is a step in the right
> direction. I'm not sure it's ready for prime-time yet, but I wanted to get
> the conversation started.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira