solvers on spark

Nick Pentreath Mon, 24 Jun 2013 14:09:40 -0700

You're right on that - so far doubles is all I've needed and all I can 
currently see needing.



I'll take a look at your project and see how easy it is to integrate with my 
Spark ALS and other code - syntax wise it looks almost the same so swapping out 
the linear algebra backend would be quite trivial in theory.


So far I've a working implementation of both implicit and explicit ALS versions 
that matches Mahout in RMSE given same parameters on the 3 movielens data sets. 
Still some work to do and more testing at scale, plus framework stuff. But 
hopefully I'd like to open source this at some point (but the Spark guys have a 
few projects upcoming so I'm also waiting a bit to see what happens there as it 
may end up duplicating a lot of what they're doing).

—
Sent from Mailbox for iPhone

On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:

> On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath 
> <nick.pentre...@gmail.com>wrote:
>> That looks great Dmitry!
>>
>>
>> The thing about Breeze that drives the complexity in it is partly
>> specialization for Float, Double and Int matrices, and partly getting the
>> syntax to "just work" for all combinations of matrix types and operands
>> etc. mostly it does "just work" but occasionally not.
> yes i noticed that, but since i am wrapping Mahout matrices, there's only a
> choice of double-filled matrices and vectors. Actually, i would argue
> that's the way it is supposed to be in the interest of KISS principle. I am
> not sure i see a value in "int" matrices for any problem i ever worked on,
> and skipping on precision to save the space is even more far-fetched notion
> as in real life numbers don't take as much space as their pre-vectorized
> features and annotations. In fact. model training parts and linear algebra
> are not where memory bottleneck seems to fat-up at all in my experience.
> There's often exponentially growing cpu-bound behavior, yes, but not RAM.
>>
>>
>> I am surprised that dense * sparse matrix doesn't work but I guess as I
>> previously mentioned the sparse matrix support is a bit shaky.
>>
> This is solely based on eye-balling the trait architecture. I did not
> actually attempt it. But there's no single unifying trait for sure.
>>
>>
>> David Hall is pretty happy to both look into enhancements and help out for
>> contributions (eg I'm hoping to find time to look into a proper Diagonal
>> matrix implementation and he was very helpful with pointers etc), so please
>> do drop things into the google group mailing list. Hopefully wider adoption
>> especially by this type of community will drive Breeze development.
>>
>>
>> In another note I also really like Scaldings matrix API so scala ish
>> wrappers for mahout would be cool - another pet project of mine is a port
>> of that API to spark too :)
>>
>>
>> N
>>
>>
>>
>> —
>> Sent from Mailbox for iPhone
>>
>> On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <jake.man...@gmail.com>
>> wrote:
>>
>> > Yeah, I'm totally on board with a pretty scala DSL on top of some of our
>> > stuff.
>> > In particular, I've been experimenting with with wrapping the
>> > DistributedRowMatrix
>> > in a scalding wrapper, so we can do things like
>> > val matrixAsTypedPipe =
>> >    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
>> > path, conf))
>> > // e.g. L1 normalize:
>> >   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx,
>> v.normalize(1)) )
>> >                                  .write(new
>> > DistributedRowMatrixPipe(outputPath, conf))
>> > // and anything else you would want to do with a scalding TypedPipe[Int,
>> > Vector]
>> > Currently I've been doing this with a package structure directly in
>> Mahout,
>> > in:
>> >    mahout/contrib/scalding
>> > What do people think about having this be something real, after 0.8 goes
>> > out?  Are
>> > we ready for contrib modules which fold in diverse external projects in
>> new
>> > ways?
>> > Integrating directly with pig and scalding is a bit too wide of a tent
>> for
>> > Mahout core,
>> > but putting these integrations in entirely new projects is maybe a bit
>> too
>> > far away.
>> > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>> >> Dmitriy,
>> >>
>> >> This is very pretty.
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> >> wrote:
>> >>
>> >> > Ok, so i was fairly easily able to build some DSL for our matrix
>> >> > manipulation (similar to breeze) in scala:
>> >> >
>> >> > inline matrix or vector:
>> >> >
>> >> > val  a = dense((1, 2, 3), (3, 4, 5))
>> >> >
>> >> > val b:Vector = (1,2,3)
>> >> >
>> >> > block views and assignments (element/row/vector/block/block of row or
>> >> > vector)
>> >> >
>> >> >
>> >> > a(::, 0)
>> >> > a(1, ::)
>> >> > a(0 to 1, 1 to 2)
>> >> >
>> >> > assignments
>> >> >
>> >> > a(0, ::) :=(3, 5, 7)
>> >> > a(0, 0 to 1) :=(3, 5)
>> >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
>> >> >
>> >> > operators
>> >> >
>> >> > // hadamard
>> >> > val c = a * b
>> >> >  a *= b
>> >> >
>> >> > // matrix mul
>> >> >  val m = a %*% b
>> >> >
>> >> > and bunch of other little things like sum, mean, colMeans etc. That
>> much
>> >> is
>> >> > easy.
>> >> >
>> >> > Also stuff like the ones found in breeze along the lines
>> >> >
>> >> > val (u,v,s) = svd(a)
>> >> >
>> >> > diag ((1,2,3))
>> >> >
>> >> > and Cholesky in similar ways.
>> >> >
>> >> > I don't have "inline" initialization for sparse things (yet) simply
>> >> because
>> >> > i don't need them, but of course all regular java constructors and
>> >> methods
>> >> > are retained, all that is just a syntactic sugar in the spirit of
>> DSLs in
>> >> > hope to make things a bit mroe readable.
>> >> >
>> >> > my (very little, and very insignificantly opinionated, really)
>> criticism
>> >> of
>> >> > Breeze in this context is its inconsistency between dense and sparse
>> >> > representations, namely, lack of consistent overarching trait(s), so
>> that
>> >> > building structure-agnostic solvers like Mahout's Cholesky solver is
>> >> > impossible, as well as cross-type matrix use (say, the way i
>> understand
>> >> it,
>> >> > it is pretty much imposible to multiply a sparse matrix by a dense
>> >> matrix).
>> >> >
>> >> > I suspect these problems stem from the fact that the authors for
>> whatever
>> >> > reason decided to hardwire dense things with JBlas solvers whereas i
>> dont
>> >> > believe matrix storage structures must be. But these problems do
>> appear
>> >> to
>> >> > be serious enough  for me to ignore Breeze for now. If i decide to
>> plug
>> >> in
>> >> > jblas dense solvers, i guess i will just have them as yet another
>> >> top-level
>> >> > routine interface taking any Matrix, e.g.
>> >> >
>> >> > val (u,v,s) = svd(m, jblas=true)
>> >> >
>> >> >
>> >> >
>> >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Thank you.
>> >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <ted.dunn...@gmail.com>
>> wrote:
>> >> > >
>> >> > >> I think that this contract has migrated a bit from the first
>> starting
>> >> > >> point.
>> >> > >>
>> >> > >> My feeling is that there is a de facto contract now that the matrix
>> >> > slice
>> >> > >> is a single row.
>> >> > >>
>> >> > >> Sent from my iPhone
>> >> > >>
>> >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dlie...@gmail.com>
>> >> wrote:
>> >> > >>
>> >> > >> > What does Matrix. iterateAll() contractually do? Practically it
>> >> seems
>> >> > >> to be
>> >> > >> > row wise iteration for some implementations but it doesnt seem
>> >> > >> > contractually state so in the javadoc. What is MatrixSlice if it
>> is
>> >> > >> neither
>> >> > >> > a row nor a colimn? How can i tell what exactly it is i am
>> iterating
>> >> > >> over?
>> >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <ted.dunn...@gmail.com>
>> >> > wrote:
>> >> > >> >
>> >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
>> >> jake.man...@gmail.com>
>> >> > >> >> wrote:
>> >> > >> >>
>> >> > >> >>>> Question #2: which in-core solvers are available for Mahout
>> >> > >> matrices? I
>> >> > >> >>>> know there's SSVD, probably Cholesky, is there something
>> else? In
>> >> > >> >>>> paticular, i need to be solving linear systems, I guess
>> Cholesky
>> >> > >> should
>> >> > >> >>> be
>> >> > >> >>>> equipped enough to do just that?
>> >> > >> >>>>
>> >> > >> >>>> Question #3: why did we try to import Colt solvers rather than
>> >> > >> actually
>> >> > >> >>>> depend on Colt in the first place? Why did we not accept
>> Colt's
>> >> > >> sparse
>> >> > >> >>>> matrices and created native ones instead?
>> >> > >> >>>>
>> >> > >> >>>> Colt seems to have a notion of parse in-core matrices too and
>> >> seems
>> >> > >> >> like
>> >> > >> >>> a
>> >> > >> >>>> well-rounded solution. However, it doesn't seem like being
>> >> actively
>> >> > >> >>>> supported, whereas I know Mahout experienced continued
>> >> enhancements
>> >> > >> to
>> >> > >> >>> the
>> >> > >> >>>> in-core matrix support.
>> >> > >> >>>>
>> >> > >> >>>
>> >> > >> >>> Colt was totally abandoned, and I talked to the original author
>> >> and
>> >> > he
>> >> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
>> >> > woefully
>> >> > >> >>> undertested,
>> >> > >> >>> and tried our best to hook it in with proper tests and use APIs
>> >> that
>> >> > >> fit
>> >> > >> >>> with
>> >> > >> >>> the use cases we had.  Plus, we already had the start of some
>> >> linear
>> >> > >> apis
>> >> > >> >>> (i.e.
>> >> > >> >>> the Vector interface) and dropping the API completely seemed
>> not
>> >> > >> terribly
>> >> > >> >>> worth it at the time.
>> >> > >> >>>
>> >> > >> >>
>> >> > >> >> There was even more to it than that.
>> >> > >> >>
>> >> > >> >> Colt was under-tested and there have been warts that had to be
>> >> pulled
>> >> > >> out
>> >> > >> >> in much of the code.
>> >> > >> >>
>> >> > >> >> But, worse than that, Colt's matrix and vector structure was a
>> real
>> >> > >> bugger
>> >> > >> >> to extend or change.  It also had all kinds of cruft where it
>> >> > >> pretended to
>> >> > >> >> support matrices of things, but in fact only supported matrices
>> of
>> >> > >> doubles
>> >> > >> >> and floats.
>> >> > >> >>
>> >> > >> >> So using Colt as it was (and is since it is largely abandoned)
>> was
>> >> a
>> >> > >> >> non-starter.
>> >> > >> >>
>> >> > >> >> As far as in-memory solvers, we have:
>> >> > >> >>
>> >> > >> >> 1) LR decomposition (tested and kinda fast)
>> >> > >> >>
>> >> > >> >> 2) Cholesky decomposition (tested)
>> >> > >> >>
>> >> > >> >> 3) SVD (tested)
>> >> > >> >>
>> >> > >>
>> >> > >
>> >> >
>> >>
>> > --
>> >   -jake
>>

Re: Mahout vectors/matrices/solvers on spark

Reply via email to