I and my company have allocated some time to create some mixed environment of R and other "stuff", and, in particular, Mahout. I am thinking of a "contributed" project with R where R is enabled to do the following roles:
#1 Mahout's front end driver mixing Mahout computations and R vector/matrices #2 data vectorization/preparation routines loaded into backend of Mahout's abstract job and adapted to write DRM; #3 perhaps some routines allowing subsampling & subsequent visulalization of Mahout result for prototyping and control purposes. #2 kind of comes close to what R-Hadoop project does with their mapreduce package but unfortunately it looks like that project focuses on a particular way of serialization of R objects and adaptation for DRM serialization doesn't seem plausible at this time. Besides, I am thinking that it's not so difficult to run R from inside mapper (R-Hadoop uses streaming, but i think it's worth to try R inverse java package instead of streaming and bypass the whole text/parse routine completely). Rapid prototyping and visualization of results i think is one of the bigger barriers to Mahout adoption. But enabling mixed environment for cpu-laden computations in R is a huge leap towards prototyping big data pipeline IMO. Or at least it seems from the vantage point of problems i am currently with. Rapid prototyping of Mahout pipelines may be a huge help, esp. as new methods become available to try and validate. -d On Sat, Feb 11, 2012 at 11:01 AM, Jeff Eastman <[email protected]> wrote: > Now that 0.6 is in the box, it seems a good time to start thinking about > 0.7, from a high level goal perspective at least. Here are a couple that > come to mind: > > Target code freeze date August 1, 2012 > Get Jenkins working for us again > Complete clustering refactoring and classification convergence > ...
