Well one fundamental step to get there in Mahout realm, the way i see it, is to create DSLs for Mahout's DRMs in spark. That's actually one of the other reasons i chose not to follow Breeze. When we unwind Mahout DRM's, we may see sparse or dense slices there with named vectors. To translate that into Breeze blocks would be a problem (and annotations/named vector treatment is yet another problem i guess).
On Mon, Jun 24, 2013 at 2:08 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote: > You're right on that - so far doubles is all I've needed and all I can > currently see needing. > > > I'll take a look at your project and see how easy it is to integrate with > my Spark ALS and other code - syntax wise it looks almost the same so > swapping out the linear algebra backend would be quite trivial in theory. > > > So far I've a working implementation of both implicit and explicit ALS > versions that matches Mahout in RMSE given same parameters on the 3 > movielens data sets. Still some work to do and more testing at scale, plus > framework stuff. But hopefully I'd like to open source this at some point > (but the Spark guys have a few projects upcoming so I'm also waiting a bit > to see what happens there as it may end up duplicating a lot of what > they're doing). > > — > Sent from Mailbox for iPhone > > On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath < > nick.pentre...@gmail.com>wrote: > >> That looks great Dmitry! > >> > >> > >> The thing about Breeze that drives the complexity in it is partly > >> specialization for Float, Double and Int matrices, and partly getting > the > >> syntax to "just work" for all combinations of matrix types and operands > >> etc. mostly it does "just work" but occasionally not. > > yes i noticed that, but since i am wrapping Mahout matrices, there's > only a > > choice of double-filled matrices and vectors. Actually, i would argue > > that's the way it is supposed to be in the interest of KISS principle. I > am > > not sure i see a value in "int" matrices for any problem i ever worked > on, > > and skipping on precision to save the space is even more far-fetched > notion > > as in real life numbers don't take as much space as their pre-vectorized > > features and annotations. In fact. model training parts and linear > algebra > > are not where memory bottleneck seems to fat-up at all in my experience. > > There's often exponentially growing cpu-bound behavior, yes, but not RAM. > >> > >> > >> I am surprised that dense * sparse matrix doesn't work but I guess as I > >> previously mentioned the sparse matrix support is a bit shaky. > >> > > This is solely based on eye-balling the trait architecture. I did not > > actually attempt it. But there's no single unifying trait for sure. > >> > >> > >> David Hall is pretty happy to both look into enhancements and help out > for > >> contributions (eg I'm hoping to find time to look into a proper Diagonal > >> matrix implementation and he was very helpful with pointers etc), so > please > >> do drop things into the google group mailing list. Hopefully wider > adoption > >> especially by this type of community will drive Breeze development. > >> > >> > >> In another note I also really like Scaldings matrix API so scala ish > >> wrappers for mahout would be cool - another pet project of mine is a > port > >> of that API to spark too :) > >> > >> > >> N > >> > >> > >> > >> — > >> Sent from Mailbox for iPhone > >> > >> On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <jake.man...@gmail.com> > >> wrote: > >> > >> > Yeah, I'm totally on board with a pretty scala DSL on top of some of > our > >> > stuff. > >> > In particular, I've been experimenting with with wrapping the > >> > DistributedRowMatrix > >> > in a scalding wrapper, so we can do things like > >> > val matrixAsTypedPipe = > >> > DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols, > >> > path, conf)) > >> > // e.g. L1 normalize: > >> > matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx, > >> v.normalize(1)) ) > >> > .write(new > >> > DistributedRowMatrixPipe(outputPath, conf)) > >> > // and anything else you would want to do with a scalding > TypedPipe[Int, > >> > Vector] > >> > Currently I've been doing this with a package structure directly in > >> Mahout, > >> > in: > >> > mahout/contrib/scalding > >> > What do people think about having this be something real, after 0.8 > goes > >> > out? Are > >> > we ready for contrib modules which fold in diverse external projects > in > >> new > >> > ways? > >> > Integrating directly with pig and scalding is a bit too wide of a tent > >> for > >> > Mahout core, > >> > but putting these integrations in entirely new projects is maybe a bit > >> too > >> > far away. > >> > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <ted.dunn...@gmail.com> > >> wrote: > >> >> Dmitriy, > >> >> > >> >> This is very pretty. > >> >> > >> >> > >> >> > >> >> > >> >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dlie...@gmail.com > > > >> >> wrote: > >> >> > >> >> > Ok, so i was fairly easily able to build some DSL for our matrix > >> >> > manipulation (similar to breeze) in scala: > >> >> > > >> >> > inline matrix or vector: > >> >> > > >> >> > val a = dense((1, 2, 3), (3, 4, 5)) > >> >> > > >> >> > val b:Vector = (1,2,3) > >> >> > > >> >> > block views and assignments (element/row/vector/block/block of row > or > >> >> > vector) > >> >> > > >> >> > > >> >> > a(::, 0) > >> >> > a(1, ::) > >> >> > a(0 to 1, 1 to 2) > >> >> > > >> >> > assignments > >> >> > > >> >> > a(0, ::) :=(3, 5, 7) > >> >> > a(0, 0 to 1) :=(3, 5) > >> >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5)) > >> >> > > >> >> > operators > >> >> > > >> >> > // hadamard > >> >> > val c = a * b > >> >> > a *= b > >> >> > > >> >> > // matrix mul > >> >> > val m = a %*% b > >> >> > > >> >> > and bunch of other little things like sum, mean, colMeans etc. That > >> much > >> >> is > >> >> > easy. > >> >> > > >> >> > Also stuff like the ones found in breeze along the lines > >> >> > > >> >> > val (u,v,s) = svd(a) > >> >> > > >> >> > diag ((1,2,3)) > >> >> > > >> >> > and Cholesky in similar ways. > >> >> > > >> >> > I don't have "inline" initialization for sparse things (yet) simply > >> >> because > >> >> > i don't need them, but of course all regular java constructors and > >> >> methods > >> >> > are retained, all that is just a syntactic sugar in the spirit of > >> DSLs in > >> >> > hope to make things a bit mroe readable. > >> >> > > >> >> > my (very little, and very insignificantly opinionated, really) > >> criticism > >> >> of > >> >> > Breeze in this context is its inconsistency between dense and > sparse > >> >> > representations, namely, lack of consistent overarching trait(s), > so > >> that > >> >> > building structure-agnostic solvers like Mahout's Cholesky solver > is > >> >> > impossible, as well as cross-type matrix use (say, the way i > >> understand > >> >> it, > >> >> > it is pretty much imposible to multiply a sparse matrix by a dense > >> >> matrix). > >> >> > > >> >> > I suspect these problems stem from the fact that the authors for > >> whatever > >> >> > reason decided to hardwire dense things with JBlas solvers whereas > i > >> dont > >> >> > believe matrix storage structures must be. But these problems do > >> appear > >> >> to > >> >> > be serious enough for me to ignore Breeze for now. If i decide to > >> plug > >> >> in > >> >> > jblas dense solvers, i guess i will just have them as yet another > >> >> top-level > >> >> > routine interface taking any Matrix, e.g. > >> >> > > >> >> > val (u,v,s) = svd(m, jblas=true) > >> >> > > >> >> > > >> >> > > >> >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov < > dlie...@gmail.com> > >> >> > wrote: > >> >> > > >> >> > > Thank you. > >> >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <ted.dunn...@gmail.com> > >> wrote: > >> >> > > > >> >> > >> I think that this contract has migrated a bit from the first > >> starting > >> >> > >> point. > >> >> > >> > >> >> > >> My feeling is that there is a de facto contract now that the > matrix > >> >> > slice > >> >> > >> is a single row. > >> >> > >> > >> >> > >> Sent from my iPhone > >> >> > >> > >> >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dlie...@gmail.com> > >> >> wrote: > >> >> > >> > >> >> > >> > What does Matrix. iterateAll() contractually do? Practically > it > >> >> seems > >> >> > >> to be > >> >> > >> > row wise iteration for some implementations but it doesnt seem > >> >> > >> > contractually state so in the javadoc. What is MatrixSlice if > it > >> is > >> >> > >> neither > >> >> > >> > a row nor a colimn? How can i tell what exactly it is i am > >> iterating > >> >> > >> over? > >> >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" < > ted.dunn...@gmail.com> > >> >> > wrote: > >> >> > >> > > >> >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix < > >> >> jake.man...@gmail.com> > >> >> > >> >> wrote: > >> >> > >> >> > >> >> > >> >>>> Question #2: which in-core solvers are available for Mahout > >> >> > >> matrices? I > >> >> > >> >>>> know there's SSVD, probably Cholesky, is there something > >> else? In > >> >> > >> >>>> paticular, i need to be solving linear systems, I guess > >> Cholesky > >> >> > >> should > >> >> > >> >>> be > >> >> > >> >>>> equipped enough to do just that? > >> >> > >> >>>> > >> >> > >> >>>> Question #3: why did we try to import Colt solvers rather > than > >> >> > >> actually > >> >> > >> >>>> depend on Colt in the first place? Why did we not accept > >> Colt's > >> >> > >> sparse > >> >> > >> >>>> matrices and created native ones instead? > >> >> > >> >>>> > >> >> > >> >>>> Colt seems to have a notion of parse in-core matrices too > and > >> >> seems > >> >> > >> >> like > >> >> > >> >>> a > >> >> > >> >>>> well-rounded solution. However, it doesn't seem like being > >> >> actively > >> >> > >> >>>> supported, whereas I know Mahout experienced continued > >> >> enhancements > >> >> > >> to > >> >> > >> >>> the > >> >> > >> >>>> in-core matrix support. > >> >> > >> >>>> > >> >> > >> >>> > >> >> > >> >>> Colt was totally abandoned, and I talked to the original > author > >> >> and > >> >> > he > >> >> > >> >>> blessed it's adoption. When we pulled it in, we found it > was > >> >> > woefully > >> >> > >> >>> undertested, > >> >> > >> >>> and tried our best to hook it in with proper tests and use > APIs > >> >> that > >> >> > >> fit > >> >> > >> >>> with > >> >> > >> >>> the use cases we had. Plus, we already had the start of > some > >> >> linear > >> >> > >> apis > >> >> > >> >>> (i.e. > >> >> > >> >>> the Vector interface) and dropping the API completely seemed > >> not > >> >> > >> terribly > >> >> > >> >>> worth it at the time. > >> >> > >> >>> > >> >> > >> >> > >> >> > >> >> There was even more to it than that. > >> >> > >> >> > >> >> > >> >> Colt was under-tested and there have been warts that had to > be > >> >> pulled > >> >> > >> out > >> >> > >> >> in much of the code. > >> >> > >> >> > >> >> > >> >> But, worse than that, Colt's matrix and vector structure was > a > >> real > >> >> > >> bugger > >> >> > >> >> to extend or change. It also had all kinds of cruft where it > >> >> > >> pretended to > >> >> > >> >> support matrices of things, but in fact only supported > matrices > >> of > >> >> > >> doubles > >> >> > >> >> and floats. > >> >> > >> >> > >> >> > >> >> So using Colt as it was (and is since it is largely > abandoned) > >> was > >> >> a > >> >> > >> >> non-starter. > >> >> > >> >> > >> >> > >> >> As far as in-memory solvers, we have: > >> >> > >> >> > >> >> > >> >> 1) LR decomposition (tested and kinda fast) > >> >> > >> >> > >> >> > >> >> 2) Cholesky decomposition (tested) > >> >> > >> >> > >> >> > >> >> 3) SVD (tested) > >> >> > >> >> > >> >> > >> > >> >> > > > >> >> > > >> >> > >> > -- > >> > -jake > >> >