Great thread, the kind I wanted to start.

+1 for some sort of release plan. We can either do time-based plans or feature 
based plans, but we should pick one instead of doing both in parallel.

+1 for refactoring done in small steps. While I can understand how it can 
affect ongoing progress, I strongly believe that not doing it will hurt us even 
more.

> I worry about the system becoming overly modular/abstracted. For example,
> YARN took me awhile to figure out when I was writing Kitten, in no small
> part b/c there are so many modules to go through before I could figure out
> how everything hung together. I think that having a ton of different
> modules to wade through in search of understanding is a barrier to
> adoption-- at least, to adoption by people like me who like to poke at
> stuff. I'd want to have some discussion around how deep the rabbit hole
> goes here.


Let me harp on this for a while, especially given I am responsible for the 
source structure there :) I completely understand your pain, as I've heard from 
others too. But I argue that the solution isn't to have a monolithic code base.

The reason why I started the original discussion of the split is this: I wanted 
to see how can start writing my own a crunch example - from the POV of a crunch 
user. I started looking for the APIs and it turned it to be difficult, with api 
and implementation all woven together - just like Hadoop MapReduce if you ask 
me. Sure you could write more docs(which is a welcome effort BTW), but giving 
an immediate feedback explaining what is part of the API and what isn't, what 
methods are for consumption for users and what are the impl details that can 
change anytime, what API is really public and what isn't (arguably this is a 
java limitation of how package and non-package visibility is defined, but we 
are stuck with this).

That said, I agree that we need to hit a sweet spot here - just enough 
modularity to make APIs, libs and impl to make things easy for users and for 
evolving each of them cleanly but not much to the point that it becomes 
intractable for developers.

> For example, say we added streaming data support, so that we could have
> pipelines that operated on streams as well as batch input data. Clearly,
> this will necessitate some API changes to DoFns in order to support things
> that only make sense in a streaming context, and it's unlikely that there
> would be any overlap between the lib/* and impl/* functionality that would
> be applicable to streaming and batch contexts. So would we end up with:
> 
> crunch-core-api (shared between batch and stream, e.g., DoFns, MapFns, etc.)
> crunch-batch-api (PCollection and PTable and friends)
> crunch-stream-api (PStream, etc.)
> crunch-batch-impl
> crunch-batch-lib
> crunch-stream-impl
> crunch-stream-lib
> 
> ? And if so, do we want to rename the modules over time to reflect their
> new, more-specific functionality? We go towards crunch-hbase-batch and
> crunch-hbase-streaming and crunch-solr-batch and crunch-solr-streaming, or
> do we have top-level core, batch, and streaming modules w/the
> extension-specific submodules underneath them?


I don't know much on this, but I thought significant parts of current crunch 
code base is all batch oriented: the apis, the plan optimizations etc. Do you 
think, if we wish to do something streaming oriented, the APIs will remain the 
same?

Irrespective of that, you could have a simpler organization:
 - a top-level crunch-batch and crunch-stream
 - and crunch-*/lib crunch-*/impl.

Thanks,
+Vinod

Reply via email to