On Sun, Sep 16, 2012 at 9:42 AM, Matthias Friedrich <[email protected]> wrote: > Hi, > > On Saturday, 2012-09-15, Josh Wills wrote: >> On Sat, Sep 15, 2012 at 2:55 AM, Matthias Friedrich <[email protected]> wrote: > [...] > >> I worry about the system becoming overly modular/abstracted. For example, >> YARN took me awhile to figure out when I was writing Kitten, in no small >> part b/c there are so many modules to go through before I could figure out >> how everything hung together. I think that having a ton of different >> modules to wade through in search of understanding is a barrier to >> adoption-- at least, to adoption by people like me who like to poke at >> stuff. I'd want to have some discussion around how deep the rabbit hole >> goes here. > > I see what you mean, and I think there has been a misunderstanding. To > me, modularization doesn't mean we have to put everything into a > separate Maven module. Given Java's current limitations that's what is > often done, but this is just one way of doing it. I know Maven > multi-module projects are cumbersome to work with (hell, I don't even > *like* Maven). I would use a Maven module when there are different > sets of dependencies (like with HBase) or when I need to create > separate artifacts. > > My primary concern is to separate interfaces/abstractions and > implementation on a package level. This way we can easily exclude > implementation stuff from the user Javadocs, limiting the conceptional > surface of Crunch significantly (see [1] for how big the system looks > from a user perspective). Of course, package-level dependencies > shouldn't contain cycles, which in most cases is done by making sure > that abstractions don't depend on implementations. Most people do > this by instinct, in Crunch it's correct in the vast majority of > cases. There are just a few misplaced classes and a few times > implementations and abstractions are mixed inside a single class. This > makes the dependency graph look like a mess (and Javadoc links would > point to nirvana), but it's all fixable.
This sounds good to me, but now I think that there was possibly indeed a misunderstanding here, as I also had the feeling that the goal was to split out a lot of things into different modules. Vinod, can you confirm that your idea on this is in line with what Matthias is talking about here. I can definitely see the point that is being made with the huge number of packages available in the user javadoc, while the number of public packages is much smaller. In general (and based on my experience), the way to make it easy to get started with Crunch is have a few clear examples on the website/wiki, as these are usual starting point for most developers. However, once you get past the stage of making things work, I found that it was really easy to miss out existing functionality (and reimplement it myself) because the API docs do indeed contain way too much stuff. Where I'm going with this is that I don't think the current situation gets in the way of getting started with Crunch, but it is detrimental to being really efficient with Crunch once you get past the "getting started" phase. Back on the topic of splitting things into modules (which appears to not really be the focus now if I understand correctly), I have had experience with projects that went very far with this (GeoTools [1] is a good example of this), and I found that it made it *really* difficult to get started with those projects, and definitely scared a lot of people away from them. > >> For example, say we added streaming data support, so that we could >> have >> pipelines that operated on streams as well as batch input data. Clearly, >> this will necessitate some API changes to DoFns in order to support things >> that only make sense in a streaming context, and it's unlikely that there >> would be any overlap between the lib/* and impl/* functionality that would >> be applicable to streaming and batch contexts. So would we end up with: > [...] > > Hmm, maybe we should discuss the streaming stuff on a separate thread. > I'm not sure what you want to achieve (real time stream processing, > CEP, ...?) or if it really makes sense to implement this as part of > Crunch at all, but the number of Maven modules looks excessive :) I'm also not too sure about the streaming support, and I also like the idea of a separate thread. Sounds very interesting, but it also sounds like it's going outside the scope of Crunch (or at least the scope that I see for Crunch). - Gabriel [1] http://geotools.org
