Re: Codebase refactoring proposal

Pat Ferrel Sun, 08 Feb 2015 19:46:22 -0800

I completely understand that MLlib lacks anything like the completeness of 
Mahout's DSL, I know of no other scalable solution to match.  I don’t know how 
many times this has to be said. This is something we can all get behind as 
*unique* to Mahout.

But I stand by the statement that there should also be some lower level data 
commonality. There is too much similarity to dismiss and go completely 
non-overlapping ways. Even if you can ague for maintaining separate parallel 
ways let’s have some type conversions (I hesitate to say easy to use) They 
shouldn’t be all that hard.

A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would solve 
my Kmeans use case. You know MLlib better than I so choose the best level to 
perform type conversions or inheritance splicing. The point is to make the two 
as seamless as possible. Doesn’t this seem a worthy goal?

On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov <[email protected]> wrote:

Pat,

I *just* made a case in this thread explaining that mllib does not have a
single distributed matrix types and that its own methodologies do not
interoperate within itself for that reason. Therefore, it is fundamentally
impossible to be interoperable with mllib since nobody really can define
what it means in terms of distributed types.

You are in fact referring  to their in-core type, not a distributed type.
But there's no linear algebra operation support to speak of there either.
It is, simply, not algebra, at the moment. The types in this hierarchy are
just memory storage models, and private scope converters to breeze storage
models, but they are not true linalg apis nor providers of such.

One might concievably want to standardize on Breeze apis since those are
both linalg api and providers, but not the type you've been mentioning.

However, it is not a very happy path either. Breeze is somewhat more
interesting substrate to build in-core operations on, but if you read spark
forum of late, even spark developers express a whiff of dissatisfaction
with it in favor of BIDMat (me too btw). But while they say Bidmat would be
a better choice for in-core operatros, they also recognize the fact that
they are too invested into breeze api by now and such move would not be
cheap across the board.

And that demonstrates another problem on in-core mllib architectrue  there:
on one side, they don't have sufficient public in-core dsl or api to speak
of; but they also do not have a sufficiently abstract api for in-core blas
plugins either to be truly agnostic of the available in-core methodologies.

So what you are talking about, is simply not possible with current state of
things there. But if it were, i'd just suggest you to try to port algebraic
things you like in Mahout, to mllib.

My guess however is that you'd find that porting algebraic optimizer with
proper level of consistency with in-core operations will not be easy for
reasons including, but not limited to, the ones i just mentioned; although
individual blas  like matrix square you've mentioned would be fairly easy
to do for one of the distributed matrix types in mllib. But that of course
would not be an R like environment and not an optimizer.

I like bidmat a lot though; but it is not truly hybrid and self-adjusting
environment for in-core operations either (and its dsl is neither Rlike nor
matlab like, so it takes a bit of adjusting to). For that reason even
Bidmat linalg types and dsl are not truly versatile enough for our (well,
my anyway) purposes (which are to find the best hardware or software
subroutine automatically given current hardware and software platform
architecture and parameters of the requested operation).
On Feb 8, 2015 9:05 AM, "Pat Ferrel" <[email protected]> wrote:

> Why aren’t we using linalg.Vector and its siblings? The same could be
> asked for linalg.Matrix. If we want to prune dependencies this would help
> and would also significantly increase interoperability.
> 
> Case-now: I have a real need to cluster items in a CF type input matrix.
> The input matrix A’ has row of items. I need to drop this into a sequence
> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
> and maybe could be helped with some implicit conversions mahout.Vector <->
> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> Kmeans).
> 
> Case-possible: If we adopted linalg.Vector as the native format and
> perhaps even linalg.Matrix this would give immediate interoperability in
> some areas including my specific need. It would significantly pare down
> dependencies not provided by the environment (Mahout-math). It would also
> support creating distributed computation methods that would work on MLlib
> and Mahout datasets addressing Gokhan’s question.
> 
> I looked at another “Case-now” possibility, which was to go all MLlib with
> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> why would you want to do that?” Not even in the multiply form A’A, A’B,
> AA’, all used in item and row similarity. That stopped me from looking
> deeper.
> 
> The strength and unique value of Mahout is the completeness of its
> generalized linear algebra DSL. But insistence on using Mahout specific
> data types is also a barrier for Spark people adopting the DSL. Not having
> lower level interoperability is a barrier both ways to mixing Mahout and
> MLlib—creating unnecessary either/or choices for devs.
> 
> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <[email protected]> wrote:
> 
> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <[email protected]> wrote:
> 
>> What I am saying is that for certain algorithms including both
>> engine-specific (such as aggregation) and DSL stuff, what is the best way
>> of handling them?
>> 
>> i) should we add the distributed operations to Mahout codebase as it is
>> proposed in #62?
>> 
> 
> Imo this can't go very well and very far (because of the engine specifics)
> but i'd be willing to see an experiment with simple things like map and
> reduce.
> 
> Bigger quesitons are, where exactly we'll have to stop (we can't abstract
> all capabilities out there becuase of "common denominator" issues), and
> what percentage of methods will it truly allow to migrate to full backend
> portability.
> 
> And if after doing all this, we will still find ourselves writing engine
> specific mixes, why bother. Wouldn't it be better to find a good,
> easy-to-replicate, incrementally-developed pattern to register and apply
> engine-specific strategies for every method?
> 
> 
>> 
>> ii) should we have [engine]-ml modules (like spark-bindings and
>> h2o-bindings) where we can mix the DSL and engine-specific stuff?
>> 
> 
> This is not quite what i am proposing. Rather, engine-ml modules holding
> engine-specific _parts_ of algorithm.
> 
> However, this really needs a POC over a guniea pig (similarly to how we
> POC'd algebra in the first place with ssvd and spca).
> 
> 
>> 
>> 
> 
>

Re: Codebase refactoring proposal

Reply via email to