+1-- very cool. :)
On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <[email protected]> wrote: > Hey Christian, > > This looks awesome! There have been a bunch of times when I've been > digging around in the planner and wanting to have something like this, > so yes, I definitely think this is useful to have. > > - Gabriel > > > On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov > <[email protected]> wrote: > > Hi, > > > > While exploring the Crunch MR execution flow I decided to augment the > > excellent pipeline DOT diagram with few additional visualizations of some > > interesting (for me) internal/intermediate pipeline preparation states. > > Such like the output-pcollection-targets structure (used for the pipeline > > planning), the Graphs before and after the split up of dependent GBK > nodes > > and the RTNode hierarchy as persistent in the Configuration before the > > execution of the pipeline. > > For each diagram I've plotted some relevant internals like the PTypte > > structures. The implementation hack includes 3 additional DotfileWriters > > hooked inside the MSCRPlanner#plan() to intercept the flow. > > > > An example of the diagrams generated from the > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked > below. > > > > Do we need such internals visualization? Something like visualization of > > the logical, mapping and physical (e.g. RTNodes) plans of the pipeline > > preparation? What do you think? > > > > Cheers, > > Christian > > > > > > Diagrams generated from the > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline. > > > > - Dotfile containing all graphs: > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot > > > > > > 1. > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png > > - is the existing diagram. It provides very well balanced view of the > > pipeline, showing how the functional blocks are mapped into execution > > Map/Reduce components and the dependencies between them. > > > > 2. > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png > > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs) > in > > the MSCRPlanner on plan() operation is execution: > > - Each data flow is depicted with different color to indicate the > > overlapping execution paths. > > - The PCollection name, class and PTypes are shown. > > > > 3. > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png > > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method. > It > > draws the Vertices with their names, pcollection and ptype. The arc label > > lists the Graph's edge path lists. > > > > 4. > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png > > - Graph created in the MSCRPlanner#plan() after the splits up of > dependent > > GBK nodes and break the graph up into connected components - bounded by > > read dashed line. > > > > 5. > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png > > - Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer > as > > well as the Inputs and Outputs. > > - RTNodes are deserialized from the Job's > > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped to > > the containing Map or Reduce tasks and parent Crunch Job. The > relationship > > between RTNodes (e.g. parent/children) is depicted with arrows. > > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String, > > OutputConfig> and depicted in the magenta subgraph > > - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle, > > Map<Integer, List<Path>>> and depicted in green subgraph > > - The inputs are mapped to the corresponding RTNode using the nodeIndex > > reference. > > - Outputs are mapped to the corresponding RTNode by the Output Name > > references > > - There is not good way to print the anonymous DoFn instances. > > - Note: the dependency between the crunch jobs is not drawn as it my > > require access to the competition hook attributes. > > - Note: in order to draw the RTNodes i had to expose its attributes via > > public getters. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
