On Wednesday, 2013-02-27, Chao Shi wrote: > I'm developing a complex pipeline (30+ MRs plus lots of joins). I have a > hard time to understand which part of the pipeline spends most running time > and how much intermediate output does it produce. Crunch's optimization > work is great, but it makes the execution plan difficult to be understood. > Each time I modified the pipeline, I have to dump the dot file and run > graphviz to generate a new picture and examine if there's anything wrong. > > About security, I'm not familiar with how Hadoop does it. I will try to > reuse hadoop's HttpServer (does it have something to do with security?). > The bottom line is to make this feature disabled by default, and let users > enable it at their own risk.
OK, sounds good. > If this feature is enabled, the user can choose to use unused port or > specified port. I haven't got an idea that how the user know the randomly > picked port (via log?) . I will be working on a prototype version first, > and see if the status page is generally useful. Yeah, logging the URL would probably be the only thing that works. Not counting fancy stuff like MDNS ;-) In my opinion, we should try to get this done with the dependencies that we already get through Hadoop. Each additional library we add to Crunch will cause interoperability problems for someone. Regards, Matthias
