solvers on spark

Jake Mannix Mon, 24 Jun 2013 13:25:27 -0700

Yeah, I'm totally on board with a pretty scala DSL on top of some of our
stuff.


In particular, I've been experimenting with with wrapping the
DistributedRowMatrix
in a scalding wrapper, so we can do things like

val matrixAsTypedPipe =
   DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
path, conf))

// e.g. L1 normalize:
  matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx, v.normalize(1)) )
                                 .write(new
DistributedRowMatrixPipe(outputPath, conf))

// and anything else you would want to do with a scalding TypedPipe[Int,
Vector]

Currently I've been doing this with a package structure directly in Mahout,
in:

   mahout/contrib/scalding

What do people think about having this be something real, after 0.8 goes
out?  Are
we ready for contrib modules which fold in diverse external projects in new
ways?
Integrating directly with pig and scalding is a bit too wide of a tent for
Mahout core,
but putting these integrations in entirely new projects is maybe a bit too
far away.


On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Dmitriy,
>
> This is very pretty.
>
>
>
>
> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > Ok, so i was fairly easily able to build some DSL for our matrix
> > manipulation (similar to breeze) in scala:
> >
> > inline matrix or vector:
> >
> > val  a = dense((1, 2, 3), (3, 4, 5))
> >
> > val b:Vector = (1,2,3)
> >
> > block views and assignments (element/row/vector/block/block of row or
> > vector)
> >
> >
> > a(::, 0)
> > a(1, ::)
> > a(0 to 1, 1 to 2)
> >
> > assignments
> >
> > a(0, ::) :=(3, 5, 7)
> > a(0, 0 to 1) :=(3, 5)
> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> >
> > operators
> >
> > // hadamard
> > val c = a * b
> >  a *= b
> >
> > // matrix mul
> >  val m = a %*% b
> >
> > and bunch of other little things like sum, mean, colMeans etc. That much
> is
> > easy.
> >
> > Also stuff like the ones found in breeze along the lines
> >
> > val (u,v,s) = svd(a)
> >
> > diag ((1,2,3))
> >
> > and Cholesky in similar ways.
> >
> > I don't have "inline" initialization for sparse things (yet) simply
> because
> > i don't need them, but of course all regular java constructors and
> methods
> > are retained, all that is just a syntactic sugar in the spirit of DSLs in
> > hope to make things a bit mroe readable.
> >
> > my (very little, and very insignificantly opinionated, really) criticism
> of
> > Breeze in this context is its inconsistency between dense and sparse
> > representations, namely, lack of consistent overarching trait(s), so that
> > building structure-agnostic solvers like Mahout's Cholesky solver is
> > impossible, as well as cross-type matrix use (say, the way i understand
> it,
> > it is pretty much imposible to multiply a sparse matrix by a dense
> matrix).
> >
> > I suspect these problems stem from the fact that the authors for whatever
> > reason decided to hardwire dense things with JBlas solvers whereas i dont
> > believe matrix storage structures must be. But these problems do appear
> to
> > be serious enough  for me to ignore Breeze for now. If i decide to plug
> in
> > jblas dense solvers, i guess i will just have them as yet another
> top-level
> > routine interface taking any Matrix, e.g.
> >
> > val (u,v,s) = svd(m, jblas=true)
> >
> >
> >
> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > Thank you.
> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:
> > >
> > >> I think that this contract has migrated a bit from the first starting
> > >> point.
> > >>
> > >> My feeling is that there is a de facto contract now that the matrix
> > slice
> > >> is a single row.
> > >>
> > >> Sent from my iPhone
> > >>
> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
> > >>
> > >> > What does Matrix. iterateAll() contractually do? Practically it
> seems
> > >> to be
> > >> > row wise iteration for some implementations but it doesnt seem
> > >> > contractually state so in the javadoc. What is MatrixSlice if it is
> > >> neither
> > >> > a row nor a colimn? How can i tell what exactly it is i am iterating
> > >> over?
> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <ted.dunn...@gmail.com>
> > wrote:
> > >> >
> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> jake.man...@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >>>> Question #2: which in-core solvers are available for Mahout
> > >> matrices? I
> > >> >>>> know there's SSVD, probably Cholesky, is there something else? In
> > >> >>>> paticular, i need to be solving linear systems, I guess Cholesky
> > >> should
> > >> >>> be
> > >> >>>> equipped enough to do just that?
> > >> >>>>
> > >> >>>> Question #3: why did we try to import Colt solvers rather than
> > >> actually
> > >> >>>> depend on Colt in the first place? Why did we not accept Colt's
> > >> sparse
> > >> >>>> matrices and created native ones instead?
> > >> >>>>
> > >> >>>> Colt seems to have a notion of parse in-core matrices too and
> seems
> > >> >> like
> > >> >>> a
> > >> >>>> well-rounded solution. However, it doesn't seem like being
> actively
> > >> >>>> supported, whereas I know Mahout experienced continued
> enhancements
> > >> to
> > >> >>> the
> > >> >>>> in-core matrix support.
> > >> >>>>
> > >> >>>
> > >> >>> Colt was totally abandoned, and I talked to the original author
> and
> > he
> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
> > woefully
> > >> >>> undertested,
> > >> >>> and tried our best to hook it in with proper tests and use APIs
> that
> > >> fit
> > >> >>> with
> > >> >>> the use cases we had.  Plus, we already had the start of some
> linear
> > >> apis
> > >> >>> (i.e.
> > >> >>> the Vector interface) and dropping the API completely seemed not
> > >> terribly
> > >> >>> worth it at the time.
> > >> >>>
> > >> >>
> > >> >> There was even more to it than that.
> > >> >>
> > >> >> Colt was under-tested and there have been warts that had to be
> pulled
> > >> out
> > >> >> in much of the code.
> > >> >>
> > >> >> But, worse than that, Colt's matrix and vector structure was a real
> > >> bugger
> > >> >> to extend or change.  It also had all kinds of cruft where it
> > >> pretended to
> > >> >> support matrices of things, but in fact only supported matrices of
> > >> doubles
> > >> >> and floats.
> > >> >>
> > >> >> So using Colt as it was (and is since it is largely abandoned) was
> a
> > >> >> non-starter.
> > >> >>
> > >> >> As far as in-memory solvers, we have:
> > >> >>
> > >> >> 1) LR decomposition (tested and kinda fast)
> > >> >>
> > >> >> 2) Cholesky decomposition (tested)
> > >> >>
> > >> >> 3) SVD (tested)
> > >> >>
> > >>
> > >
> >
>



-- 

  -jake

Re: Mahout vectors/matrices/solvers on spark

Reply via email to