Hi, On Saturday, 2012-10-27, Gabriel Reid wrote: > In the few times that I've debugged issues in the planner in Crunch, > it always takes me a bit of time to figure out (again) how things > work there. I've been thinking/planning of writing some more inline > docs and doing a bit of refactoring in the code to help myself (and > others) with doing this in the future, but something else that I was > thinking of was the generation of DOT[1] files for pipelines so that > it's easier to visualize what's going on.
That's a great idea, it will help to win prospective users over who wonder whether Crunch's performs as well as a sequence of hand-written MR jobs. There are other ways in Java to generate graphs, BTW, but from my experience none of them produces output that matches dot/graphviz. In my opinion we shouldn't run dot ourselves though, because most users don't have dot installed. just generate the output and let users call dot themselves. > I'm sure that functionality like this can be useful (at least to me, > as I was just using it in a somewhat ad-hoc way to debug > CRUNCH-102), but I'm not sure if this is something we want to expose > easily, or keep pretty hidden to just use for debugging. I believe > Pig provides this same functionality with the "explain" command. > Any thoughts on adding this, particularly around how we could/should > expose it in the API? I think we should make it available for users and make it really easy to access it. I'm not sure about the API, though. Since it's really cheap to create we could always generate dot output, store it inside the Configuration instance and provide a static utility class to access it? A while ago we discussed moving debugging/log4j manipulation logic out of the MRPipeline, perhaps we can use a single CrunchDebug utilty for both. Regards, Matthias
