Tom,

Thanks for the your input. I have nothing to argue with but I think
project can use help of the people who are kicking the tires in a way
that they may make those problems (in particular, scale problems)
available to the list.

> They discover Mahout, which does specifically bill itself as scalable
> (from http://mahout.apache.org, in some of the largest letters: "What
> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
> is to build scalable machine learning libraries.").  They sniff check
> it by massaging some moderately-sized data set into the same format as
> an example from the wiki and they fail to get a result - often because
> their problem has some very different properties (more classes, much
> larger feature space, etc.) and the implementation has some limitation
> that they trip over.

I would risk to go out on a limb and say no single person knows
exactly the limitations of _all_ currently existing contributions
(it's a community after all, not a  vendorized product), and in few
cases I suspect no proper scale experiment was even set up (i mean, as
in thousands of nodes clusters, it's kind of hard to fund that on an
ongoing basis), only approximation is known. But the contribution is
not necessarily  rejected just because of that. We'll have to work to
gather it on wiki. I think "Mahout in Action" book among other things
represents such an attempt to focus on what is proven and stable and
has known limits.

Part of the difficulties of approximating performance is because in
few cases the run time is super-linear to the input size and it is
hard to see when exactly Hadoop I/O or GC is going to start acting up.

BTW If you have a concrete experimental data showing method
limitations mentioned on wiki, please don't hesitate to share, it will
be taken with great appreciation. There are people who are eager to
make improvements, when such room for improvement becomes apparent
based on benchmarks.

But conducting and submitting benchmarks is the key IMO. I don't think
there's another way to work the kinks out other than address them
based on problem reports.

> I'm not sure where this information is best placed - it would be great
> to see it on the Wiki along with the examples, at least.  It would be

I think wiki is the place.

> awesome to see warnings at runtime ("Warning: You just trained a model
> that you cannot load without at least 20GB of RAM"), but I'm not sure
> how realistic that is.  I would like it to be easier to determine, at
> some very high level, why something didn't work when an experiment
> fails.  Ideally, without having to dive into the code at all.

People who work with MR are routinely accustomed to look at job
counters to see an estimate of sizes (that's what i do). I see the
value in creating some custom counters in certain cases and reporting
them. that's reasonable, I guess. Similar to what Pig does. But at
this point i don't see a direct link to annotations though for this
kind of functionality. I think it is what "Imrovement" jira request is
for on case per case basis.


Thank you.

-Dmitriy

Reply via email to