solvers on spark

Dmitriy Lyubimov Mon, 24 Jun 2013 14:20:42 -0700

Well one fundamental step to get there in Mahout realm, the way i see it,
is to create DSLs for Mahout's DRMs in spark. That's actually one of the
other reasons i chose not to follow Breeze. When we unwind Mahout DRM's, we
may see sparse or dense slices there with named vectors. To translate that
into Breeze blocks would be a problem (and annotations/named vector
treatment is yet another problem i guess).



On Mon, Jun 24, 2013 at 2:08 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote:

> You're right on that - so far doubles is all I've needed and all I can
> currently see needing.
>
>
> I'll take a look at your project and see how easy it is to integrate with
> my Spark ALS and other code - syntax wise it looks almost the same so
> swapping out the linear algebra backend would be quite trivial in theory.
>
>
> So far I've a working implementation of both implicit and explicit ALS
> versions that matches Mahout in RMSE given same parameters on the 3
> movielens data sets. Still some work to do and more testing at scale, plus
> framework stuff. But hopefully I'd like to open source this at some point
> (but the Spark guys have a few projects upcoming so I'm also waiting a bit
> to see what happens there as it may end up duplicating a lot of what
> they're doing).
>
> —
> Sent from Mailbox for iPhone
>
> On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath <
> nick.pentre...@gmail.com>wrote:
> >> That looks great Dmitry!
> >>
> >>
> >> The thing about Breeze that drives the complexity in it is partly
> >> specialization for Float, Double and Int matrices, and partly getting
> the
> >> syntax to "just work" for all combinations of matrix types and operands
> >> etc. mostly it does "just work" but occasionally not.
> > yes i noticed that, but since i am wrapping Mahout matrices, there's
> only a
> > choice of double-filled matrices and vectors. Actually, i would argue
> > that's the way it is supposed to be in the interest of KISS principle. I
> am
> > not sure i see a value in "int" matrices for any problem i ever worked
> on,
> > and skipping on precision to save the space is even more far-fetched
> notion
> > as in real life numbers don't take as much space as their pre-vectorized
> > features and annotations. In fact. model training parts and linear
> algebra
> > are not where memory bottleneck seems to fat-up at all in my experience.
> > There's often exponentially growing cpu-bound behavior, yes, but not RAM.
> >>
> >>
> >> I am surprised that dense * sparse matrix doesn't work but I guess as I
> >> previously mentioned the sparse matrix support is a bit shaky.
> >>
> > This is solely based on eye-balling the trait architecture. I did not
> > actually attempt it. But there's no single unifying trait for sure.
> >>
> >>
> >> David Hall is pretty happy to both look into enhancements and help out
> for
> >> contributions (eg I'm hoping to find time to look into a proper Diagonal
> >> matrix implementation and he was very helpful with pointers etc), so
> please
> >> do drop things into the google group mailing list. Hopefully wider
> adoption
> >> especially by this type of community will drive Breeze development.
> >>
> >>
> >> In another note I also really like Scaldings matrix API so scala ish
> >> wrappers for mahout would be cool - another pet project of mine is a
> port
> >> of that API to spark too :)
> >>
> >>
> >> N
> >>
> >>
> >>
> >> —
> >> Sent from Mailbox for iPhone
> >>
> >> On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <jake.man...@gmail.com>
> >> wrote:
> >>
> >> > Yeah, I'm totally on board with a pretty scala DSL on top of some of
> our
> >> > stuff.
> >> > In particular, I've been experimenting with with wrapping the
> >> > DistributedRowMatrix
> >> > in a scalding wrapper, so we can do things like
> >> > val matrixAsTypedPipe =
> >> >    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
> >> > path, conf))
> >> > // e.g. L1 normalize:
> >> >   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx,
> >> v.normalize(1)) )
> >> >                                  .write(new
> >> > DistributedRowMatrixPipe(outputPath, conf))
> >> > // and anything else you would want to do with a scalding
> TypedPipe[Int,
> >> > Vector]
> >> > Currently I've been doing this with a package structure directly in
> >> Mahout,
> >> > in:
> >> >    mahout/contrib/scalding
> >> > What do people think about having this be something real, after 0.8
> goes
> >> > out?  Are
> >> > we ready for contrib modules which fold in diverse external projects
> in
> >> new
> >> > ways?
> >> > Integrating directly with pig and scalding is a bit too wide of a tent
> >> for
> >> > Mahout core,
> >> > but putting these integrations in entirely new projects is maybe a bit
> >> too
> >> > far away.
> >> > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <ted.dunn...@gmail.com>
> >> wrote:
> >> >> Dmitriy,
> >> >>
> >> >> This is very pretty.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dlie...@gmail.com
> >
> >> >> wrote:
> >> >>
> >> >> > Ok, so i was fairly easily able to build some DSL for our matrix
> >> >> > manipulation (similar to breeze) in scala:
> >> >> >
> >> >> > inline matrix or vector:
> >> >> >
> >> >> > val  a = dense((1, 2, 3), (3, 4, 5))
> >> >> >
> >> >> > val b:Vector = (1,2,3)
> >> >> >
> >> >> > block views and assignments (element/row/vector/block/block of row
> or
> >> >> > vector)
> >> >> >
> >> >> >
> >> >> > a(::, 0)
> >> >> > a(1, ::)
> >> >> > a(0 to 1, 1 to 2)
> >> >> >
> >> >> > assignments
> >> >> >
> >> >> > a(0, ::) :=(3, 5, 7)
> >> >> > a(0, 0 to 1) :=(3, 5)
> >> >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> >> >> >
> >> >> > operators
> >> >> >
> >> >> > // hadamard
> >> >> > val c = a * b
> >> >> >  a *= b
> >> >> >
> >> >> > // matrix mul
> >> >> >  val m = a %*% b
> >> >> >
> >> >> > and bunch of other little things like sum, mean, colMeans etc. That
> >> much
> >> >> is
> >> >> > easy.
> >> >> >
> >> >> > Also stuff like the ones found in breeze along the lines
> >> >> >
> >> >> > val (u,v,s) = svd(a)
> >> >> >
> >> >> > diag ((1,2,3))
> >> >> >
> >> >> > and Cholesky in similar ways.
> >> >> >
> >> >> > I don't have "inline" initialization for sparse things (yet) simply
> >> >> because
> >> >> > i don't need them, but of course all regular java constructors and
> >> >> methods
> >> >> > are retained, all that is just a syntactic sugar in the spirit of
> >> DSLs in
> >> >> > hope to make things a bit mroe readable.
> >> >> >
> >> >> > my (very little, and very insignificantly opinionated, really)
> >> criticism
> >> >> of
> >> >> > Breeze in this context is its inconsistency between dense and
> sparse
> >> >> > representations, namely, lack of consistent overarching trait(s),
> so
> >> that
> >> >> > building structure-agnostic solvers like Mahout's Cholesky solver
> is
> >> >> > impossible, as well as cross-type matrix use (say, the way i
> >> understand
> >> >> it,
> >> >> > it is pretty much imposible to multiply a sparse matrix by a dense
> >> >> matrix).
> >> >> >
> >> >> > I suspect these problems stem from the fact that the authors for
> >> whatever
> >> >> > reason decided to hardwire dense things with JBlas solvers whereas
> i
> >> dont
> >> >> > believe matrix storage structures must be. But these problems do
> >> appear
> >> >> to
> >> >> > be serious enough  for me to ignore Breeze for now. If i decide to
> >> plug
> >> >> in
> >> >> > jblas dense solvers, i guess i will just have them as yet another
> >> >> top-level
> >> >> > routine interface taking any Matrix, e.g.
> >> >> >
> >> >> > val (u,v,s) = svd(m, jblas=true)
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <
> dlie...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> > > Thank you.
> >> >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <ted.dunn...@gmail.com>
> >> wrote:
> >> >> > >
> >> >> > >> I think that this contract has migrated a bit from the first
> >> starting
> >> >> > >> point.
> >> >> > >>
> >> >> > >> My feeling is that there is a de facto contract now that the
> matrix
> >> >> > slice
> >> >> > >> is a single row.
> >> >> > >>
> >> >> > >> Sent from my iPhone
> >> >> > >>
> >> >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dlie...@gmail.com>
> >> >> wrote:
> >> >> > >>
> >> >> > >> > What does Matrix. iterateAll() contractually do? Practically
> it
> >> >> seems
> >> >> > >> to be
> >> >> > >> > row wise iteration for some implementations but it doesnt seem
> >> >> > >> > contractually state so in the javadoc. What is MatrixSlice if
> it
> >> is
> >> >> > >> neither
> >> >> > >> > a row nor a colimn? How can i tell what exactly it is i am
> >> iterating
> >> >> > >> over?
> >> >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <
> ted.dunn...@gmail.com>
> >> >> > wrote:
> >> >> > >> >
> >> >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> >> >> jake.man...@gmail.com>
> >> >> > >> >> wrote:
> >> >> > >> >>
> >> >> > >> >>>> Question #2: which in-core solvers are available for Mahout
> >> >> > >> matrices? I
> >> >> > >> >>>> know there's SSVD, probably Cholesky, is there something
> >> else? In
> >> >> > >> >>>> paticular, i need to be solving linear systems, I guess
> >> Cholesky
> >> >> > >> should
> >> >> > >> >>> be
> >> >> > >> >>>> equipped enough to do just that?
> >> >> > >> >>>>
> >> >> > >> >>>> Question #3: why did we try to import Colt solvers rather
> than
> >> >> > >> actually
> >> >> > >> >>>> depend on Colt in the first place? Why did we not accept
> >> Colt's
> >> >> > >> sparse
> >> >> > >> >>>> matrices and created native ones instead?
> >> >> > >> >>>>
> >> >> > >> >>>> Colt seems to have a notion of parse in-core matrices too
> and
> >> >> seems
> >> >> > >> >> like
> >> >> > >> >>> a
> >> >> > >> >>>> well-rounded solution. However, it doesn't seem like being
> >> >> actively
> >> >> > >> >>>> supported, whereas I know Mahout experienced continued
> >> >> enhancements
> >> >> > >> to
> >> >> > >> >>> the
> >> >> > >> >>>> in-core matrix support.
> >> >> > >> >>>>
> >> >> > >> >>>
> >> >> > >> >>> Colt was totally abandoned, and I talked to the original
> author
> >> >> and
> >> >> > he
> >> >> > >> >>> blessed it's adoption.  When we pulled it in, we found it
> was
> >> >> > woefully
> >> >> > >> >>> undertested,
> >> >> > >> >>> and tried our best to hook it in with proper tests and use
> APIs
> >> >> that
> >> >> > >> fit
> >> >> > >> >>> with
> >> >> > >> >>> the use cases we had.  Plus, we already had the start of
> some
> >> >> linear
> >> >> > >> apis
> >> >> > >> >>> (i.e.
> >> >> > >> >>> the Vector interface) and dropping the API completely seemed
> >> not
> >> >> > >> terribly
> >> >> > >> >>> worth it at the time.
> >> >> > >> >>>
> >> >> > >> >>
> >> >> > >> >> There was even more to it than that.
> >> >> > >> >>
> >> >> > >> >> Colt was under-tested and there have been warts that had to
> be
> >> >> pulled
> >> >> > >> out
> >> >> > >> >> in much of the code.
> >> >> > >> >>
> >> >> > >> >> But, worse than that, Colt's matrix and vector structure was
> a
> >> real
> >> >> > >> bugger
> >> >> > >> >> to extend or change.  It also had all kinds of cruft where it
> >> >> > >> pretended to
> >> >> > >> >> support matrices of things, but in fact only supported
> matrices
> >> of
> >> >> > >> doubles
> >> >> > >> >> and floats.
> >> >> > >> >>
> >> >> > >> >> So using Colt as it was (and is since it is largely
> abandoned)
> >> was
> >> >> a
> >> >> > >> >> non-starter.
> >> >> > >> >>
> >> >> > >> >> As far as in-memory solvers, we have:
> >> >> > >> >>
> >> >> > >> >> 1) LR decomposition (tested and kinda fast)
> >> >> > >> >>
> >> >> > >> >> 2) Cholesky decomposition (tested)
> >> >> > >> >>
> >> >> > >> >> 3) SVD (tested)
> >> >> > >> >>
> >> >> > >>
> >> >> > >
> >> >> >
> >> >>
> >> > --
> >> >   -jake
> >>
>

Re: Mahout vectors/matrices/solvers on spark

Reply via email to