[
https://issues.apache.org/jira/browse/CRUNCH-34?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills updated CRUNCH-34:
-----------------------------
Fix Version/s: 0.4.0
> Refactor the MSCRPlanner logic
> ------------------------------
>
> Key: CRUNCH-34
> URL: https://issues.apache.org/jira/browse/CRUNCH-34
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.3.0
> Reporter: Josh Wills
> Assignee: Josh Wills
> Fix For: 0.4.0
>
> Attachments: PLANNER-REFACTORING.patch
>
>
> I had a conversation with Robert awhile back about one of the shoddier areas
> of the Crunch codebase-- the planning logic. It relies on a whole bunch of
> mutable state, which makes the logic of the overall planning process
> incomprehensible to anyone except for me (back when I wrote it) and Gabriel
> (who grokked it well enough to fix some bugs in it.)
> It turns out that understanding the planning process is actually pretty easy
> if you map the logical plan to a graph that has three kinds of vertices:
> Source, Target, and GroupByKey (GBK). All of the other nodes in the logical
> plan (primarily DoCollection/DoTable instances) make up the edges of the
> graph.
> Once you take this graph perspective, you can think of the MapReduce job
> creation process entirely in terms of graph operations:
> 1) Walk the logical plan and construct the initial Graph object, which allows
> Edges to exist between GBK vertices.
> 2) Build a new graph that is identical to the first one, except it eliminates
> Edges between GBK vertices by constructing additional Source and Target
> vertices.
> 3) Identify all of the (weakly) connected components of the new graph.
> 4) Construct MapReduce jobs out of the connected components, either map-only
> jobs when there is no GBK node in the component, or MapReduce jobs when there
> is one (or a fusion job when there is more than one.)
> I've been working on this off-and-on for a couple of weeks, and I have a
> version of the planning code that implements the description above and passes
> all of our tests. There are still places where we have mutable state that
> will need to be cleaned up, but I think this is a step in the right
> direction. I'm not sure it's ready for prime-time yet, but I wanted to get
> the conversation started.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira