Re: SparkBindings on a real cluster
@Seb: submitted PRs containing simple fixes for the ones having exlamation marks... assuming they are critical. If you could test/comment, would be awesome. On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter wrote: > Hi, > > I did some experimentation with the spark bindings on a real cluster > yesterday, as I had to run some experiments for a paper (unrelated to > Mahout) that I'm currently writing. The experiment basically consists of > multiplying a sparse data matrix by a super-sparse permutation-like matrix > from the left. It took me the whole day to get it working, up to matrices > with 500M entries. > > I ran into lots of issues that we have to fix asap, unfortunately I don't > have much time in the next weeks, so I'm just sharing a list of the issues > that I ran into (maybe I'll find some time to create issues for these > things on the weekend). > > I think the major challenge for us will be to get choice of dense/sparse > correct and put lots of work into memory efficiency. This could be a great > hook for collaborating with the h20 folks, as they know how to make > vector-like data small and computations fast. > > Here's the list: > > * our matrix serialization in MatrixWritable is seriously flawed, I ran > into the following errors > > - the type information is stored with every vector although a matrix > always only contains vectors of the same type > - all entries of a TransposeView (and possibly other views) of a sparse > matrix are serialized, resulting in OOM > - for sparse row matrices, the vectors are set using assign instead of > via constructor injection, this results in huge memory consumption and long > creation times, as in some implementations, binary search is used for > assignment > > * a dense matrix is converted into a SparseRowMatrix with dense row > vectors by blockify(), after serialization this becomes a dense matrix in > sparse format (triggering OOMs)! > > * drmFromHDFS does not have an option to set the number of desired > partitions > > * SparseRowMatrix with sequential vectors times SparseRowMatrix with > sequential vectors is totally broken, it uses three nested loops and uses > get(row, col) on the matrices, which internally uses binary search... > > * At operator adds the column vectors it creates, this is unnecessary as > we don't need the addition, we can just merge the vectors > > * we need a dedicated operator for inCoreA %*% drmB, currently this gets > rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have > a prototype of that operator) > > Best, > Sebastian > > >
Re: SparkBindings on a real cluster
thanks for identifying these issues. This is great help. as well expected, in-memory stuff performance is (and will always be as power law algorithms will stay predominantly cpu-bound) a function of mahout-math (i.e. in-core math). So i think we can put at rest the "Spark performance crashed by something" discussions. Re: in-memory efficiency. I am dubious about h2o involvement with the set of thier current techniques for a few reasons. One, their compression techniques do not fit our in-core algebra use cases (or fit them as corner cases only such as indicator matrix but not say even ternary matrix). I don't want to extend the argument base here, but can explain it on a side if it is to pursue any pragmatical merit. Second, the compression issue itself is not the main issue here. A large portion of it is cost-based algorithm application on matrix-involved operations, which we simply don't have for stuff like open-hash based sparse matrices. Third, purely from project management perspective, 0xdata has already been called upon to participate on these issues more than once and chose to stay away. They've been clear they are extolling merits for a particular business interest here. Naturally we can't expect them fixing existing Mahout stuff beyond that stated business interest. Mahout's Colt-derived in-core math clearly has not been that scope. Re: Matrix serialization issues: So it's even worse than i thought then. This sounds serious. Re: blockify: i thought there (used to be, at least) an attempt to make an assessment of incoming data sparsity by looking at first incoming vector type, and subsequently treat the block data accordingly. I can look into it. Re: drmFromHdfs: this method parititions based on data affinity. There's no need to mess with this process additionally on this level. There may be an issue of further re-splitting the data for the purposes of improved task scheduling (one of the issues identified in my talk), but that is not a business of loader, and it is not limited to the case of initial loading. This needs a bit more thought rather than hacking the loader IMO. On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter wrote: > Hi, > > * SparseRowMatrix with sequential vectors times SparseRowMatrix with > sequential vectors is totally broken, it uses three nested loops and uses > get(row, col) on the matrices, which internally uses binary search... > back to my point about cost-based matrix operations... But this is a huge issue, it has been big for vectors, and is probably even bigger for matrices. I wonder if I'd rather test it with Breeze, to see if they are any better by now handling that mixed operand types stuff. Or they pretty much just restricted themselves to a solid lapack integration. That would completely de-mahout-ize the whole idea though but may save a lot of grief dealing with mahout-math issues. I don't preclude that internally i may have to do just that. > > * At operator adds the column vectors it creates, this is unnecessary as > we don't need the addition, we can just merge the vectors > They need to be added to handle cases of matrices with duplicate row keys. It is a bit weird but there's code using this fact (Lloyd iterations of K-means, not in public code base). Are you sure vector addition cannot be as efficient as concat, assuming we expect it to be _mostly_ a concat? Not making much of an argument here, just pointing out existing use-case of mine which this technique will not cover. > > * we need a dedicated operator for inCoreA %*% drmB, currently this gets > rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have > a prototype of that operator) > Well this involves one shuffle task. Outer transposition may still be rewritten. (I guess in practice it didn't for you). I did not have a chance to think about it much yet (since i haven't had the case of left-multiply by in-core operand), but I couldn't think of any algorithm that wouldn't involve at least one shuffle task. I'd be happy to look at your prototype, esp. if it avoids shuffle tasks the way right-multiply version does. > Best, > Sebastian > > >
Re: SparkBindings on a real cluster
Great list of issues. On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter wrote: > Hi, > > I did some experimentation with the spark bindings on a real cluster > yesterday, as I had to run some experiments for a paper (unrelated to > Mahout) that I'm currently writing. The experiment basically consists of > multiplying a sparse data matrix by a super-sparse permutation-like matrix > from the left. It took me the whole day to get it working, up to matrices > with 500M entries. > > I ran into lots of issues that we have to fix asap, unfortunately I don't > have much time in the next weeks, so I'm just sharing a list of the issues > that I ran into (maybe I'll find some time to create issues for these > things on the weekend). > > I think the major challenge for us will be to get choice of dense/sparse > correct and put lots of work into memory efficiency. This could be a great > hook for collaborating with the h20 folks, as they know how to make > vector-like data small and computations fast. > > Here's the list: > > * our matrix serialization in MatrixWritable is seriously flawed, I ran > into the following errors > > - the type information is stored with every vector although a matrix > always only contains vectors of the same type > - all entries of a TransposeView (and possibly other views) of a sparse > matrix are serialized, resulting in OOM > - for sparse row matrices, the vectors are set using assign instead of > via constructor injection, this results in huge memory consumption and long > creation times, as in some implementations, binary search is used for > assignment > > * a dense matrix is converted into a SparseRowMatrix with dense row > vectors by blockify(), after serialization this becomes a dense matrix in > sparse format (triggering OOMs)! > > * drmFromHDFS does not have an option to set the number of desired > partitions > > * SparseRowMatrix with sequential vectors times SparseRowMatrix with > sequential vectors is totally broken, it uses three nested loops and uses > get(row, col) on the matrices, which internally uses binary search... > > * At operator adds the column vectors it creates, this is unnecessary as > we don't need the addition, we can just merge the vectors > > * we need a dedicated operator for inCoreA %*% drmB, currently this gets > rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have > a prototype of that operator) > > Best, > Sebastian > > >
SparkBindings on a real cluster
Hi, I did some experimentation with the spark bindings on a real cluster yesterday, as I had to run some experiments for a paper (unrelated to Mahout) that I'm currently writing. The experiment basically consists of multiplying a sparse data matrix by a super-sparse permutation-like matrix from the left. It took me the whole day to get it working, up to matrices with 500M entries. I ran into lots of issues that we have to fix asap, unfortunately I don't have much time in the next weeks, so I'm just sharing a list of the issues that I ran into (maybe I'll find some time to create issues for these things on the weekend). I think the major challenge for us will be to get choice of dense/sparse correct and put lots of work into memory efficiency. This could be a great hook for collaborating with the h20 folks, as they know how to make vector-like data small and computations fast. Here's the list: * our matrix serialization in MatrixWritable is seriously flawed, I ran into the following errors - the type information is stored with every vector although a matrix always only contains vectors of the same type - all entries of a TransposeView (and possibly other views) of a sparse matrix are serialized, resulting in OOM - for sparse row matrices, the vectors are set using assign instead of via constructor injection, this results in huge memory consumption and long creation times, as in some implementations, binary search is used for assignment * a dense matrix is converted into a SparseRowMatrix with dense row vectors by blockify(), after serialization this becomes a dense matrix in sparse format (triggering OOMs)! * drmFromHDFS does not have an option to set the number of desired partitions * SparseRowMatrix with sequential vectors times SparseRowMatrix with sequential vectors is totally broken, it uses three nested loops and uses get(row, col) on the matrices, which internally uses binary search... * At operator adds the column vectors it creates, this is unnecessary as we don't need the addition, we can just merge the vectors * we need a dedicated operator for inCoreA %*% drmB, currently this gets rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have a prototype of that operator) Best, Sebastian