+1
This is something from my lucene experience, the explain query was quite
helpful if the result was not as per expectation. Thus the dot would
information is quite handy when you create pipelines.
On 28-10-2012 01:10, Gabriel Reid wrote:
On 27 Oct 2012, at 15:39, Matthias Friedrich <[email protected]> wrote:
Hi,
On Saturday, 2012-10-27, Gabriel Reid wrote:
In the few times that I've debugged issues in the planner in Crunch,
it always takes me a bit of time to figure out (again) how things
work there. I've been thinking/planning of writing some more inline
docs and doing a bit of refactoring in the code to help myself (and
others) with doing this in the future, but something else that I was
thinking of was the generation of DOT[1] files for pipelines so that
it's easier to visualize what's going on.
That's a great idea, it will help to win prospective users over who
wonder whether Crunch's performs as well as a sequence of hand-written
MR jobs.
There are other ways in Java to generate graphs, BTW, but from my
experience none of them produces output that matches dot/graphviz. In
my opinion we shouldn't run dot ourselves though, because most users
don't have dot installed. just generate the output and let users call
dot themselves.
Yes, graphviz/dot also have the advantage of being pretty ubiquitous.
I definitely agree on not running dot ourselves -- the main point to me
for now is just making the information available to anyone who's
interested in it.
I'm sure that functionality like this can be useful (at least to me,
as I was just using it in a somewhat ad-hoc way to debug
CRUNCH-102), but I'm not sure if this is something we want to expose
easily, or keep pretty hidden to just use for debugging. I believe
Pig provides this same functionality with the "explain" command.
Any thoughts on adding this, particularly around how we could/should
expose it in the API?
I think we should make it available for users and make it really easy
to access it. I'm not sure about the API, though. Since it's really
cheap to create we could always generate dot output, store it inside
the Configuration instance and provide a static utility class to
access it? A while ago we discussed moving debugging/log4j manipulation
logic out of the MRPipeline, perhaps we can use a single CrunchDebug
utilty for both.
I really like the idea of sticking the dot information in the Configuration. In
fact, one of the (several) issues I had before you mentioned that was that
there are actually a few different graphs built up during the planning phase,
and it would be interesting to have access to all of them. Putting them into
a Configuration will resolve that.
I guess we don't need to worry about the API too much for now; if we just
populate the information in the Configuration, we can see how (or if) we
need to make a specific API around it when we get to that point.
- Gabriel