I and my company have allocated some time to create some mixed
environment of R and other "stuff", and, in particular, Mahout. I am
thinking of a "contributed" project with R where R is enabled to do
the following roles:

#1 Mahout's front end driver mixing Mahout computations and R vector/matrices
#2 data vectorization/preparation routines loaded into backend of
Mahout's abstract job and adapted to write DRM;
#3 perhaps some routines allowing subsampling & subsequent
visulalization of Mahout result for prototyping and control purposes.


#2 kind of comes close to what R-Hadoop project does with their
mapreduce package but unfortunately it looks like that project focuses
on a particular way of serialization of R objects and adaptation for
DRM serialization doesn't seem plausible at this time. Besides, I am
thinking that it's not so difficult to run R from inside mapper
(R-Hadoop uses streaming, but i think it's worth to try R inverse java
package instead of streaming and bypass the whole text/parse routine
completely).

Rapid prototyping and visualization of results i think is one of the
bigger barriers to Mahout adoption. But enabling mixed environment for
cpu-laden computations in R is a huge leap towards prototyping big
data pipeline IMO. Or at least it seems from the vantage point of
problems i am currently with. Rapid prototyping of Mahout pipelines
may be a huge help, esp. as new methods become available to try and
validate.

-d

On Sat, Feb 11, 2012 at 11:01 AM, Jeff Eastman
<[email protected]> wrote:
> Now that 0.6 is in the box, it seems a good time to start thinking about
> 0.7, from a high level goal perspective at least. Here are a couple that
> come to mind:
>
> Target code freeze date August 1, 2012
> Get Jenkins working for us again
> Complete clustering refactoring and classification convergence
> ...

Reply via email to