[ 
https://issues.apache.org/jira/browse/CRUNCH-34?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434039#comment-13434039
 ] 

Robert Chu commented on CRUNCH-34:
----------------------------------

The collection classes I created are based on the current PCollectionImpl 
classes.
                
> Refactor the MSCRPlanner logic
> ------------------------------
>
>                 Key: CRUNCH-34
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-34
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.3.0
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: PLANNER-REFACTORING.patch
>
>
> I had a conversation with Robert awhile back about one of the shoddier areas 
> of the Crunch codebase-- the planning logic. It relies on a whole bunch of 
> mutable state, which makes the logic of the overall planning process 
> incomprehensible to anyone except for me (back when I wrote it) and Gabriel 
> (who grokked it well enough to fix some bugs in it.)
> It turns out that understanding the planning process is actually pretty easy 
> if you map the logical plan to a graph that has three kinds of vertices: 
> Source, Target, and GroupByKey (GBK). All of the other nodes in the logical 
> plan (primarily DoCollection/DoTable instances) make up the edges of the 
> graph.
> Once you take this graph perspective, you can think of the MapReduce job 
> creation process entirely in terms of graph operations:
> 1) Walk the logical plan and construct the initial Graph object, which allows 
> Edges to exist between GBK vertices.
> 2) Build a new graph that is identical to the first one, except it eliminates 
> Edges between GBK vertices by constructing additional Source and Target 
> vertices.
> 3) Identify all of the (weakly) connected components of the new graph.
> 4) Construct MapReduce jobs out of the connected components, either map-only 
> jobs when there is no GBK node in the component, or MapReduce jobs when there 
> is one (or a fusion job when there is more than one.)
> I've been working on this off-and-on for a couple of weeks, and I have a 
> version of the planning code that implements the description above and passes 
> all of our tests. There are still places where we have mutable state that 
> will need to be cleaned up, but I think this is a step in the right 
> direction. I'm not sure it's ready for prime-time yet, but I wanted to get 
> the conversation started.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to