On Tue, Feb 26, 2013 at 9:07 PM, Chao Shi <[email protected]> wrote: > Josh, > > It is exactly what I need. It can help to decouple the status web server > from core crunch jar (as it depends on jetty, which is not necessary for > everyone). > > Just to make sure I understand correctly: > > runAsync returns a future-like object, e.g. RunningPipeline. A user can use > it to start a web server. > > RunningPipeline runningPipeline = pipeline.runAsync(); > > StatusServer statusServer = new StatusServer(runningPipeline, port); > > statusServer.start(); > > runningPipeline.waitUntilDone(); > > statusServer.stop(); > > > It would be also nice to expose information about each MR stage as well. If > this requires careful design of API, dot graph is enough for now. >
Yeah, exactly. I'll make the implementation an interface that extends ListenableFuture<PipelineResult> so we can add methods to it as appropriate. J > > On Wed, Feb 27, 2013 at 12:29 PM, Josh Wills <[email protected]> wrote: > > > Hey Chao, > > > > Does the asynchronous pipeline execution work in > > https://issues.apache.org/jira/browse/CRUNCH-156 help with this? Right > > now, > > it returns an ListenableFuture<PipelineResult> from runAsync, but we > could > > add support for returning the graphviz plan as well, so that you could > fire > > up a server to visualize the file while the job was running. > > > > J > > > > > > On Tue, Feb 26, 2013 at 8:03 PM, Chao Shi <[email protected]> wrote: > > > > > Yes, it is for debugging and monitoring. > > > > > > I'm developing a complex pipeline (30+ MRs plus lots of joins). I have > a > > > hard time to understand which part of the pipeline spends most running > > time > > > and how much intermediate output does it produce. Crunch's optimization > > > work is great, but it makes the execution plan difficult to be > > understood. > > > Each time I modified the pipeline, I have to dump the dot file and run > > > graphviz to generate a new picture and examine if there's anything > wrong. > > > > > > About security, I'm not familiar with how Hadoop does it. I will try to > > > reuse hadoop's HttpServer (does it have something to do with > security?). > > > The bottom line is to make this feature disabled by default, and let > > users > > > enable it at their own risk. > > > > > > If this feature is enabled, the user can choose to use unused port or > > > specified port. I haven't got an idea that how the user know the > randomly > > > picked port (via log?) . I will be working on a prototype version > first, > > > and see if the status page is generally useful. > > > > > > On Wed, Feb 27, 2013 at 2:30 AM, Matthias Friedrich <[email protected]> > > wrote: > > > > > > > Hi Chao, > > > > > > > > sounds interesting - just a couple of things that come to mind: > > > > > > > > I this intended as debugging aid or for operational monitoring? > > > > > > > > A Crunch job is a temporary thing, to me this doesn't sound like a > > > > good match for a web service because it disappears after a (possibly > > > > short) time. Also, when multiple jobs are executed concurrently from > > > > the same machine, you can't work with a well-known port, you'd have > to > > > > pick an unused port for each job. > > > > > > > > It also looks to me like this has security implications? Right now, > > > > Crunch is just a client library and we're part of Hadoop's security > > > > framework. A web service we might have to secure in some way. > > > > > > > > Regards, > > > > Matthias > > > > > > > > On Tuesday, 2013-02-26, Chao Shi wrote: > > > > > Hi Crunch Devs, > > > > > > > > > > I'm interested in adding a web status page to crunch. I'm working > on > > a > > > > > prototype first, which simply runs a jetty server and renders the > dot > > > > file > > > > > produced by DotFileWriter at browser. The dot rendering work is > done > > by > > > > > viz.js <https://github.com/mdaines/viz.js>. It can successfully > > render > > > > the > > > > > plan into SVG. > > > > > > > > > > I think there are 2 issues I hit with viz.js: > > > > > > > > > > 1. The license of viz.js is unclear. It is compiled from GraphViz > > > source > > > > > code with emscripten. GraphViz is Eclipse Public License 1.0. > > > > > > > > > > 2. viz.js is big and slow. It is a 1.4MB compressed JS. It takes 1 > > or 2 > > > > > seconds on my laptop to render my pipeline (30+ MRs). I think it > good > > > to > > > > > have the graph refresh frequently and show the running status of > the > > > > > pipeline (i.e. whether MRs are done or not). Thus the rendering > time > > > > would > > > > > be too slow. > > > > > > > > > > Another approach is to call graphviz command at server side, if > > viz.js > > > is > > > > > not possible. I can't find any pure Java implementation of > graphviz. > > > > > > > > > > Looking forward to your advices. > > > > > > > > > > Thanks, > > > > > Chao > > > > > > > > > > > > > > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
