On Fri, Sep 14, 2012 at 7:23 PM, Matthias Friedrich <[email protected]> wrote: > > should we discuss the focus of our next release? Maybe make a list > of things we want to achieve? Or would this be too much process? >
Sorry for being so slow on the reply to this, I'm in a bit of a vacation mode with spotty internet coverage. I'll just post my top priorities here before jumping into the other discussions. For me, the focus on the short-term to mid-term for Crunch should be to improve usability and stability -- looking at the big picture, I'd say that a kind of mission statement for Crunch could be to eliminate the need to ever write another mapper or reducer ever again, while being the defacto way to work with Hadoop MapReduce. From my own perspective, this goal is already largely achieved, but I think that convincing a larger population of developers will take a bit more work. The main steps that I see for this are: 1. Make it easier to get started with Crunch. This has already been discussed further down in this thread, and there are different ways to achieve it, but I think that the common theme is improve documentation and API clarity/simplicity. 2. Get rid of the "big" bugs. This is an obvious one, and I'm not sure of how many there are still lurking in Crunch now, but the incorrect total-order sorting and object-reuse bugs that have been dealt with recently are the kinds of "big" bugs that I'm mostly worried about, as they have major implications and are easy to get burnt by without noticing it. 3. Make Crunch more pluggable, and therefore easier to migrate to -- as in, make it possible to plug in existing Mapper and Reducer plugins, as well as making it easier to plug in existing InputFormats and OutputFormats. 4. Add some handy little things like more clever input and output file handling -- for example, allow giving a glob pattern and a directory as input, with Crunch finding the correct input files by recursively searching the input directory. Another example of this is better handling for output file names (to abstract away the default Hadoop naming). My reasoning behind all of this is that the easier Crunch is to use, the more it will get used, with all good things that come with that being the natural result. I'm not sure if these steps comprise a real focus for a next release, or more fit in the category of miscellaneous new additions. - Gabriel
