Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

Kristina Rogale Plazonic Tue, 25 Aug 2015 08:58:09 -0700

> What about declaring a few simple implicit conversions between the
> MLlib and Breeze Vector classes? if you import them then you should be
> able to write a lot of the source code just as you imagine it, as if
> the Breeze methods were available on the Vector object in MLlib.

The problem is that *I don't know how* to write those implicit defs in
Scala in a good way, and that's why I'm asking the user list for a better
solution. (see below for another hack)

My understanding is that I can define a new class that would extend Vector
and have the implicit def conversion (as in the Scala manual, see below).
Since I got burned by memory issues when using my own classes in this very
way (what's the overhead of creating a new class every time I want to add
two Vectors? I don't know - I'm a lowly data scientist), I'm scared to do
it by myself.

Since you might have many Spark users with my background (some programming,
but not expert) - making everyone implement their own "addVector" function
might cause many hours of frustration that might be so much better spent on
coding. Adding +,- and scalar * can be done by a Spark contributor in under
one hour (under what I spent just writing these emails), while it would
take me a day (and multiply this by so many users like me), compounded by
uncertainty of how to proceed - do I use ml instead of mllib because
columns of a dataframe can be added while mllib can't? do I use breeze? do
i use apache.commons? do I write my own (how long will it take me)? do I
abandon Scala and go with pyspark because I don't have such problems in
numpy?

The slippery slope exists, but if you implement p-norm of a vector and
sqdist between two vectors, you should also implement simpler operations
too. There is a clear difference between functionality for adding two
vectors and taking a determinant, for example.

If I remember correctly, +,-,*,/  were implemented in a previous version of
Spark in a now deprecated class, now expunged from the documentation.

Many thanks,
Kristina

PS:
is this what you meant by adding simple implicit def? should it be a class
or object? These are kinds of questions I grapple with and why I'm asking
for example of a solution

 // this is really a pseudo-code, I know BreezeVector and SparkVector are
not real class names

class MyVector extends SparkVector {

implicit def toBreeze(v:SparkVector):BreezeVector = BreezeVector(v.toArray)

implicit def fromBreeze( bv:BreezeVector ):SparkVector = Vectors.dense(
bv.toArray )

}

On Tue, Aug 25, 2015 at 11:11 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes, you're right that it's quite on purpose to leave this API to
> Breeze, in the main. As you can see the Spark objects have already
> sprouted a few basic operations anyway; there's a slippery slope
> problem here. Why not addition, why not dot products, why not
> determinants, etc.
>
> What about declaring a few simple implicit conversions between the
> MLlib and Breeze Vector classes? if you import them then you should be
> able to write a lot of the source code just as you imagine it, as if
> the Breeze methods were available on the Vector object in MLlib.
>
> On Tue, Aug 25, 2015 at 3:35 PM, Kristina Rogale Plazonic
> <kpl...@gmail.com> wrote:
> > Well, yes, the hack below works (that's all I have time for), but is not
> > satisfactory - it is not safe, and is verbose and very cumbersome to use,
> > does not separately deal with SparseVector case and is not complete
> either.
> >
> > My question is, out of hundreds of users on this list, someone must have
> > come up with a better solution - please?
> >
> >
> > import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector =>
> BV}
> > import org.apache.spark.mllib.linalg.Vectors
> > import org.apache.spark.mllib.linalg.{Vector =>SparkVector}
> >
> > def toBreeze(v:SparkVector) = BV(v.toArray)
> >
> > def fromBreeze(bv:BV[Double]) = Vectors.dense(bv.toArray)
> >
> > def add(v1:SparkVector, v2:SparkVector) = fromBreeze( toBreeze(v1) +
> > toBreeze(v2))
> >
> > def subtract(v1:SparkVector, v2:SparkVector) = fromBreeze( toBreeze(v1) -
> > toBreeze(v2))
> >
> > def scalarMultiply(a:Double, v:SparkVector) = fromBreeze( a*toBreeze(v1)
> )
> >
> >
> > On Tue, Aug 25, 2015 at 9:41 AM, Sonal Goyal <sonalgoy...@gmail.com>
> wrote:
> >>
> >> From what I have understood, you probably need to convert your vector to
> >> breeze and do your operations there. Check
> >>
> stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors
> >>
> >> On Aug 25, 2015 7:06 PM, "Kristina Rogale Plazonic" <kpl...@gmail.com>
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I'm still not clear what is the best (or, ANY) way to add/subtract two
> >>> org.apache.spark.mllib.Vector objects in Scala.
> >>>
> >>> Ok, I understand there was a conscious Spark decision not to support
> >>> linear algebra operations in Scala and leave it to the user to choose a
> >>> linear algebra library.
> >>>
> >>> But, for any newcomer from R or Python, where you don't think twice
> about
> >>> adding two vectors, it is such a productivity shot in the foot to have
> to
> >>> write your own + operation. I mean, there is support in Spark for
> p-norm of
> >>> Vectors, for sqdist between two Vectors, but not for +/-? As I said,
> I'm a
> >>> newcomer to linear algebra in Scala and am not familiar with Breeze or
> >>> apache.commons - I am willing to learn, but would really benefit from
> >>> guidance from more experienced users. I am also not used to optimizing
> >>> low-level code and am sure that any hack I do will be just horrible.
> >>>
> >>> So, please, could somebody point me to a blog post, documentation, or
> >>> just patches for this really basic functionality. What do you do to get
> >>> around it? Am I the only one to have a problem? (And, would it really
> be so
> >>> onerous to add +/- to Spark? After all, even
> org.apache.spark.sql.Column
> >>> class does have +,-,*,/  )
> >>>
> >>> My stupid little use case is to generate some toy data for Kmeans, and
> I
> >>> need to translate a Gaussian blob to another center (for streaming and
> >>> nonstreaming KMeans both).
> >>>
> >>> Many thanks! (I am REALLY embarassed to ask such a simple question...)
> >>>
> >>> Kristina
> >
> >
>

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

Reply via email to