On Tue, Mar 12, 2013 at 3:05 PM, Josh Wills <jwi...@cloudera.com> wrote:

> ....
> First, I wanted to say that I think that there are lots of problems that
> can be handled well in MapReduce (the recent k-means streaming stuff being
> a prime example), even if they could be performed even faster using an
> in-memory model.


Well... yes.  Sort of.  It takes an enormous amount of time to come up with
algorithms like SSVD and streaming k-means.  And then the first thing
somebody wants to do is tweak things in some odd way such as trimming small
counts or putting in business rules in the middle of doing page rank.
 These small changes can easily completely blow away the derivation of the
single pass algorithm leaving us back with an iterative approach.  The same
sort of thing happens in Monte Carlo situations where one can derive a
closed form for a problem that doesn't quite do what we want.

I am happy to keep pushing on these new kinds of algorithms especially if,
like the streaming k-means, they leave us with a residue that can stand as
a surrogate for the big data for many different kinds of algorithms.

But we still have the problem at hand.


> The question for me is usually one of resource allocation:
> we're usually sharing our Hadoop clusters with lots of other jobs, and
> building a model as a MapReduce pipeline is usually a good way to play
> nicely w/everyone else and not take up too much memory.


This is a huge issue.  Fungible resources make economies work much better.
 'nuff said.

Of course, there
> are some models where the performance benefit I'll get from fitting the
> model quickly in memory are totally worth the resource costs.
>

Well, I would like to express both.

And more.

I would like to allocate half a terabyte across thousands of nodes for a 2
minute computation.  Or allocate 4000 cores with minimal memory for a
blistering single computation.


> Somewhat related, I would really like a seamless way to perform an analysis
> on a small-ish (e.g., a few hundred gigs) dataset in memory and then take
> the same code that I used to perform that analysis and scale it out to run
> over a few terabytes in batch mode.
>

Also very important.  Test small, run big.


> I'm wondering if we could solve both problems by creating a wrapper API
> that would look a lot like the Spark RDD API and then created
> implementations of that API via Spark/Giraph/Crunch (truly shameless
> promotion: http://crunch.apache.org/ ).


Long-term, this might well be a good idea.

In my view, the proximal question is whether Spark and/or Giraph are ready
for prime time, especially in a shared environment.


> ...
> I have a set of Crunch + Mahout tools I use for my own data preparation
> tasks (e.g., things like turning a CSV file of numerical and categorical
> variables into a SequenceFile of normalized Mahout Vectors w/indicator
> variables before running things like Mahout's SSVD model), so I think we
> could make a model work where we could get the performance benefits of new
> frameworks w/o having to leave MR1 behind completely.
>

So is this another thing we should discuss?  What about Pig, Crunch and
whatever modules for Mahout?  I have had great fun with Pig along these
lines and I know that lots of folks in Twitter are doing good things with
modeling in Pig.  No reason for Pig to be the only one here, either.
 Easier integration would be a huge boon for Mahout.


> I don't have an opinion on the structure of such an effort (via the
> incubator or otherwise), but I thought I would throw the idea out there, as
> it's something I would definitely like to be involved in.
>

Well let's do this!

I really prefer a Mahout sandbox or module.  Really heavy-weight isolation
isn't needed and I would love it if this could spark a wave of new blood
and activity.

Reply via email to