Hi, While exploring the Crunch MR execution flow I decided to augment the excellent pipeline DOT diagram with few additional visualizations of some interesting (for me) internal/intermediate pipeline preparation states. Such like the output-pcollection-targets structure (used for the pipeline planning), the Graphs before and after the split up of dependent GBK nodes and the RTNode hierarchy as persistent in the Configuration before the execution of the pipeline. For each diagram I've plotted some relevant internals like the PTypte structures. The implementation hack includes 3 additional DotfileWriters hooked inside the MSCRPlanner#plan() to intercept the flow.
An example of the diagrams generated from the org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked below. Do we need such internals visualization? Something like visualization of the logical, mapping and physical (e.g. RTNodes) plans of the pipeline preparation? What do you think? Cheers, Christian Diagrams generated from the org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline. - Dotfile containing all graphs: https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot 1. https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png - is the existing diagram. It provides very well balanced view of the pipeline, showing how the functional blocks are mapped into execution Map/Reduce components and the dependencies between them. 2. https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs) in the MSCRPlanner on plan() operation is execution: - Each data flow is depicted with different color to indicate the overlapping execution paths. - The PCollection name, class and PTypes are shown. 3. https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method. It draws the Vertices with their names, pcollection and ptype. The arc label lists the Graph's edge path lists. 4. https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png - Graph created in the MSCRPlanner#plan() after the splits up of dependent GBK nodes and break the graph up into connected components - bounded by read dashed line. 5. https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png - Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer as well as the Inputs and Outputs. - RTNodes are deserialized from the Job's CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped to the containing Map or Reduce tasks and parent Crunch Job. The relationship between RTNodes (e.g. parent/children) is depicted with arrows. - Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String, OutputConfig> and depicted in the magenta subgraph - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle, Map<Integer, List<Path>>> and depicted in green subgraph - The inputs are mapped to the corresponding RTNode using the nodeIndex reference. - Outputs are mapped to the corresponding RTNode by the Output Name references - There is not good way to print the anonymous DoFn instances. - Note: the dependency between the crunch jobs is not drawn as it my require access to the competition hook attributes. - Note: in order to draw the RTNodes i had to expose its attributes via public getters.
