Re: Spark RDD and Mahout DRM

Dmitriy Lyubimov Sun, 21 Sep 2014 17:53:14 -0700

There are few things going on with DRM.

First, Hadoop/MapReduce DRM in Mahout is pretty much constrained to its
persistent format on hdfs (row-wise row key/vector pairs).

When we moved to Scala, this notion received further expansion as one of
the types under governance of R-like DSL and algebraic optimizer of such
algebraic expressions. E.g. distributed ridge regression solution under
such DSL for dataset represented by tall and skinny matrix X would look
something like this:

val drmX = drmFromHdfs("X")
val y = .. (y observation vector)

val w = solve (drmX.t %*% drmX, drmX.t %*% y)

Finally, algebraic optimizer optimizes execution plan for a particular
engine, one of them being Spark's RDDs. Mahout RDDs in their checkpoint
format (e.g. fully-formed intermediate RDD result) have dual representation
-- either row-wise (tuples of key, row vectors) or block-wise (array of
keys -> matrix vertical/horizontal block).

Finally, assuming back engine is Spark's RDDs, it is possible to wrap
certain RDD types into DRM type, and vice versa, get access to checkpoint
rdd (e.g. drmX.rdd automatically creates checkpoint and exports matrix data
as an RDD).

for further details, i would hope the Mahout/Spark page would make it a bit
more clear. there's also a talk and slides from last mahout meetup
discussing main ideas here.

-d

On Sun, Sep 21, 2014 at 3:34 AM, kalmohsen <kalmoh...@ahlia.edu.bh> wrote:

> I am continuously reading about Mahout, Hadoop, Spark and Scala; willing
> to be able to add value to them. However, I am confused with 2 things:
> Spark RDD and Mahout DRM.
> I do know that spark’s RDD is used while working with Mahout. However, I
> came across some Scala code which is using Mahout DRM or wrapping RDD to
> DRM.
>
> Thus, could anyone clarify the difference between them?
>
> Thanks in advance
> Regards
>
>
>

Re: Spark RDD and Mahout DRM

Reply via email to