Re: consensus statement?

Sebastian Schelter Sun, 18 May 2014 12:19:22 -0700

I think it is important to formulate such a statement and send it outthe "outside world". But we should focus the discussion. I suggest westart with a specific draft that someone prepares (maybe Ted as hestarted the thread) and then we can discuss and reformulate theindividual sentences. I also think the formulation "the committers workon Spark" is not concise enough (and neglects a lot of our goals), but Ialso don't think it was meant to be part of an official statement inthat exact wording.


--sebastian





On 05/18/2014 07:44 PM, Pat Ferrel wrote:

Not sure why you address this to me. I agree with most of your statements.

I think Ted’s intent was to find a simple consensus statement that addresses 
where the project is going in a general way. I look at it as something to 
communicate to the outside world. Why? We are rejecting new mapreduce code. 
This was announced as a project-wide rule and has already been used to reject 
one contribution I know of. OK, what replaces Hadoop mapreduce?  What therefore 
should contributors look to as a model if not Hadoop mapreduce? Do we give no 
advice or comment on this question?

For example, I’m doing drivers that read and write text files. This is quite 
tightly coupled to Spark. Possible contributors should know that this is OK, 
that it will not be rejected and is indeed where most of the engine specific 
work is being done by committers. You are right, most of us know what we are 
doing, but simply to say “no more mapreduce” without offering an alternative 
isn’t quite fair to everyone else.

You are abstracting your code away from a specific engine, and that is great, but in 
practice anyone running it currently must run Spark. This also needs to be 
communicated. It’s as practical as answering, “What do I need to install to make 
Mahout 1.0-snapshot work?"

On May 15, 2014, at 7:17 AM, Dmitriy Lyubimov <[email protected]> wrote:

Pat, it can't be as high-level or as dteailed as it can be, I don't care,
as long as it doesn't contain misstatements. It simply can state "we adhere
to the "Apache's power of doing" principle and accept new contributions".
This is ok with me. But, as offered, it does try to enumerate strategic
directions, and in doing so, its wording is either vague, or incomplete, or
just wrong.


For example, it says "it is clear that what the committers are working on
is Spark". This is less than accurate.

First, if I interpret it literally, it is wrong, as our committers for most
part are not working on Spark, and even if they do, to whatever negligible
degree it esxists, why Mahout would care.

Second, if it is meant to say "we develop algorithms for Spark", this is
also wrong, because whatever algorithms we have added to day, have 0 Spark
dependencies.

Third, if it is meant to say that majority of what we are working on is
Spark bindings, this is still incorrect. Head count-wise, Mahout-math
tweaks and Scala enablement were at least a big effort. Hadoop 2.0 stuff
was at least as big. Documentation and tutorial work engagement was
absolute leader headcount-wise to date.

The problem i am trying to explain here is that we obviously internally
know what we are doing; but this is for external consumption so we have to
be careful to avoid miscommunication here. It is easy for us to pass on
less than accurate info delivery exactly because we already know what we
are doing and therefore our brain is happy to jump to conclusions and make
up the missing connections between stated and implied as we see it. But for
an outsider, this would sound vague or make him make wrong connections.



On Wed, May 7, 2014 at 9:54 AM, Pat Ferrel <[email protected]> wrote:

This doesn’t seem to be a vision statement. I was +1 to a simple consensus
statement.

The vision is up to you.

We have an interactive shell that scales to huge datasets without
resorting to massive subsampling. One that allows you to deal with the
exact data your black box algos work on. Every data tool has an interactive
mode except Mahout—now it does.  Virtually every complex transform as well
as basic linear algebra works on massive datasets. The interactivity will
allow people to do things with Mahout they could never do before.

We also have the building blocks to make the fastest most flexible cutting
edge collaborative filtering+metadata recommenders in the world. Honestly I
don’t see anything like this elsewhere. We will also be able to fit into
virtually any workflow and directly consume data produced in those systems
with no intermediate scrubbing. This has never happened before in Mahout
and I don’t see it in MLlib either. Even the interactive shell will benefit
from this.

Other feature champions will be able to add to this list.

Seems like the vision comes from feature champions. I may not use Mahout
in the same way you do but I rely on your code. Maybe I serve a different
user type than you. I don’t see a problem with that, do you?

On May 6, 2014, at 2:32 PM, Dmitriy Lyubimov <[email protected]> wrote:

Pat et. al,

The whole problem with original suggested consensus statement is that it
reads as "we are building MLLib for Spark (oh wait, there's already such a
thing)" and then "we are building MLLib for 0xdata" and then perhaps for
something else. Which can't be farther from the true philosophy of what has
been done. If not it, then at best it reads as "we don't know what it is we
are building, but we are including some Spark dependencies now". So it is
either misleading, or sufficiently vague, not sure which is worse.

If a collection of backend-specific separated MLLibs is the new consensus,
i can't say i can share it. In fact, the only motivation for me to do
anything within this project was to fix everything that  (per my perhaps
lopsided perception) is less than ideal with the approach of building ML
projects as backend-specific collections of black-box trainers and solvers
and bring in an ideology similar to Julia and R to the jvm-based big data
ML .

If users are to love us, somehow i think it will not be because we ported
yet another flavor of K-means to Spark.

At this point I think it is a little premature to talk about an existing
consensus, it seems.

On Tue, May 6, 2014 at 12:41 PM, Pat Ferrel <[email protected]> wrote:

+1

I personally won’t spend a lot of time generalizing right now.
Contributors can help with that if they want or make suggestions.

On May 6, 2014, at 9:23 AM, Ted Dunning <[email protected]> wrote:

As a bit of commentary, it is clear that what the committers are working

on

is Spark


Mahout committers, with very rare exceptions, are not working on Spark.
Spark committers and contributors are working on Spark.

Re: consensus statement?

Reply via email to