Hey Chao, Does the asynchronous pipeline execution work in https://issues.apache.org/jira/browse/CRUNCH-156 help with this? Right now, it returns an ListenableFuture<PipelineResult> from runAsync, but we could add support for returning the graphviz plan as well, so that you could fire up a server to visualize the file while the job was running.
J On Tue, Feb 26, 2013 at 8:03 PM, Chao Shi <[email protected]> wrote: > Yes, it is for debugging and monitoring. > > I'm developing a complex pipeline (30+ MRs plus lots of joins). I have a > hard time to understand which part of the pipeline spends most running time > and how much intermediate output does it produce. Crunch's optimization > work is great, but it makes the execution plan difficult to be understood. > Each time I modified the pipeline, I have to dump the dot file and run > graphviz to generate a new picture and examine if there's anything wrong. > > About security, I'm not familiar with how Hadoop does it. I will try to > reuse hadoop's HttpServer (does it have something to do with security?). > The bottom line is to make this feature disabled by default, and let users > enable it at their own risk. > > If this feature is enabled, the user can choose to use unused port or > specified port. I haven't got an idea that how the user know the randomly > picked port (via log?) . I will be working on a prototype version first, > and see if the status page is generally useful. > > On Wed, Feb 27, 2013 at 2:30 AM, Matthias Friedrich <[email protected]> wrote: > > > Hi Chao, > > > > sounds interesting - just a couple of things that come to mind: > > > > I this intended as debugging aid or for operational monitoring? > > > > A Crunch job is a temporary thing, to me this doesn't sound like a > > good match for a web service because it disappears after a (possibly > > short) time. Also, when multiple jobs are executed concurrently from > > the same machine, you can't work with a well-known port, you'd have to > > pick an unused port for each job. > > > > It also looks to me like this has security implications? Right now, > > Crunch is just a client library and we're part of Hadoop's security > > framework. A web service we might have to secure in some way. > > > > Regards, > > Matthias > > > > On Tuesday, 2013-02-26, Chao Shi wrote: > > > Hi Crunch Devs, > > > > > > I'm interested in adding a web status page to crunch. I'm working on a > > > prototype first, which simply runs a jetty server and renders the dot > > file > > > produced by DotFileWriter at browser. The dot rendering work is done by > > > viz.js <https://github.com/mdaines/viz.js>. It can successfully render > > the > > > plan into SVG. > > > > > > I think there are 2 issues I hit with viz.js: > > > > > > 1. The license of viz.js is unclear. It is compiled from GraphViz > source > > > code with emscripten. GraphViz is Eclipse Public License 1.0. > > > > > > 2. viz.js is big and slow. It is a 1.4MB compressed JS. It takes 1 or 2 > > > seconds on my laptop to render my pipeline (30+ MRs). I think it good > to > > > have the graph refresh frequently and show the running status of the > > > pipeline (i.e. whether MRs are done or not). Thus the rendering time > > would > > > be too slow. > > > > > > Another approach is to call graphviz command at server side, if viz.js > is > > > not possible. I can't find any pure Java implementation of graphviz. > > > > > > Looking forward to your advices. > > > > > > Thanks, > > > Chao > > > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
