Re: SparkBindings on a real cluster

2014-06-04 Thread Dmitriy Lyubimov
@Seb: submitted PRs containing simple fixes for the ones having exlamation
marks... assuming they are critical. If you could test/comment, would be
awesome.


On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter  wrote:

> Hi,
>
> I did some experimentation with the spark bindings on a real cluster
> yesterday, as I had to run some experiments for a paper (unrelated to
> Mahout) that I'm currently writing. The experiment basically consists of
> multiplying a sparse data matrix by a super-sparse permutation-like matrix
> from the left. It took me the whole day to get it working, up to matrices
> with 500M entries.
>
> I ran into lots of issues that we have to fix asap, unfortunately I don't
> have much time in the next weeks, so I'm just sharing a list of the issues
> that I ran into (maybe I'll find some time to create issues for these
> things on the weekend).
>
> I think the major challenge for us will be to get choice of dense/sparse
> correct and put lots of work into memory efficiency. This could be a great
> hook for collaborating with the h20 folks, as they know how to make
> vector-like data small and computations fast.
>
> Here's the list:
>
> * our matrix serialization in MatrixWritable is seriously flawed, I ran
> into the following errors
>
>   - the type information is stored with every vector although a matrix
> always only contains vectors of the same type
>   - all entries of a TransposeView (and possibly other views) of a sparse
> matrix are serialized, resulting in OOM
>   - for sparse row matrices, the vectors are set using assign instead of
> via constructor injection, this results in huge memory consumption and long
> creation times, as in some implementations, binary search is used for
> assignment
>
> * a dense matrix is converted into a SparseRowMatrix with dense row
> vectors by blockify(), after serialization this becomes a dense matrix in
> sparse format (triggering OOMs)!
>
> * drmFromHDFS does not have an option to set the number of desired
> partitions
>
> * SparseRowMatrix with sequential vectors times SparseRowMatrix with
> sequential vectors is totally broken, it uses three nested loops and uses
> get(row, col) on the matrices, which internally uses binary search...
>
> * At operator adds the column vectors it creates, this is unnecessary as
> we don't need the addition, we can just merge the vectors
>
> * we need a dedicated operator for inCoreA %*% drmB, currently this gets
> rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have
> a prototype of that operator)
>
> Best,
> Sebastian
>
>
>


Re: SparkBindings on a real cluster

2014-06-04 Thread Dmitriy Lyubimov
thanks for identifying these issues. This is great help.

as well expected, in-memory stuff performance is (and will always be as
power law algorithms will stay predominantly cpu-bound) a function of
mahout-math (i.e. in-core math). So i think we can put at rest the "Spark
performance crashed by something" discussions.

Re: in-memory efficiency. I am dubious about h2o involvement with the set
of thier current techniques for a few reasons.

One, their compression techniques do not fit our in-core algebra use cases
(or fit them as corner cases only such as indicator matrix but not say even
ternary matrix). I don't want to extend the argument base here, but can
explain it on a side if it is to pursue any pragmatical merit.

Second, the compression issue itself is not the main issue here. A large
portion of it is cost-based algorithm application on matrix-involved
operations, which we simply don't have for stuff like open-hash based
sparse matrices.

Third, purely from project management perspective, 0xdata has already been
called upon to participate on these issues more than once and chose to stay
away. They've been  clear they are extolling merits for a particular
business interest here. Naturally we can't expect them fixing existing
Mahout stuff beyond that stated business interest. Mahout's Colt-derived
in-core math clearly has not been that scope.

Re: Matrix serialization issues: So it's even worse than i thought then.
This sounds serious.

Re: blockify: i thought there (used to be, at least) an attempt to make an
assessment of incoming data sparsity by looking at first incoming vector
type,  and subsequently treat the block data accordingly. I can look into
it.

Re: drmFromHdfs: this method parititions based on data affinity. There's no
need to mess with this process additionally on this level. There may be an
issue of further re-splitting the data for the purposes of improved task
scheduling (one of the issues identified in my talk), but that is not a
business of loader, and it is not limited to the case of initial loading.
This needs a bit more thought rather than hacking the loader IMO.

On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter  wrote:

> Hi,
>
> * SparseRowMatrix with sequential vectors times SparseRowMatrix with
> sequential vectors is totally broken, it uses three nested loops and uses
> get(row, col) on the matrices, which internally uses binary search...
>

back to my point about cost-based matrix operations... But this is a huge
issue, it has been big for vectors, and is probably even bigger for
matrices. I wonder if I'd rather test it with Breeze, to see if they are
any better by now handling that mixed operand types stuff. Or they pretty
much just restricted themselves to a solid lapack integration. That would
completely de-mahout-ize the whole idea though but may save a lot of grief
dealing with mahout-math issues. I don't preclude that internally i may
have to do just that.


>
> * At operator adds the column vectors it creates, this is unnecessary as
> we don't need the addition, we can just merge the vectors
>

They need to be added to handle cases of matrices  with duplicate row keys.
 It is a bit weird but there's code using this fact (Lloyd iterations of
K-means, not in public code base). Are you sure vector addition cannot be
as efficient as concat, assuming we expect it to be _mostly_ a concat? Not
making much of an argument here, just pointing out existing use-case of
mine which this technique will not cover.


>
> * we need a dedicated operator for inCoreA %*% drmB, currently this gets
> rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have
> a prototype of that operator)
>

Well this involves one shuffle task. Outer transposition may still be
rewritten. (I guess in practice it didn't for you).

I did not have a chance to think about it much yet (since i haven't had the
case of left-multiply by in-core operand), but I couldn't think of any
algorithm that wouldn't involve at least one shuffle task. I'd be happy to
look at your prototype, esp. if it avoids shuffle tasks the way
right-multiply version does.


> Best,
> Sebastian
>
>
>


Re: SparkBindings on a real cluster

2014-06-04 Thread Ted Dunning
Great list of issues.




On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter  wrote:

> Hi,
>
> I did some experimentation with the spark bindings on a real cluster
> yesterday, as I had to run some experiments for a paper (unrelated to
> Mahout) that I'm currently writing. The experiment basically consists of
> multiplying a sparse data matrix by a super-sparse permutation-like matrix
> from the left. It took me the whole day to get it working, up to matrices
> with 500M entries.
>
> I ran into lots of issues that we have to fix asap, unfortunately I don't
> have much time in the next weeks, so I'm just sharing a list of the issues
> that I ran into (maybe I'll find some time to create issues for these
> things on the weekend).
>
> I think the major challenge for us will be to get choice of dense/sparse
> correct and put lots of work into memory efficiency. This could be a great
> hook for collaborating with the h20 folks, as they know how to make
> vector-like data small and computations fast.
>
> Here's the list:
>
> * our matrix serialization in MatrixWritable is seriously flawed, I ran
> into the following errors
>
>   - the type information is stored with every vector although a matrix
> always only contains vectors of the same type
>   - all entries of a TransposeView (and possibly other views) of a sparse
> matrix are serialized, resulting in OOM
>   - for sparse row matrices, the vectors are set using assign instead of
> via constructor injection, this results in huge memory consumption and long
> creation times, as in some implementations, binary search is used for
> assignment
>
> * a dense matrix is converted into a SparseRowMatrix with dense row
> vectors by blockify(), after serialization this becomes a dense matrix in
> sparse format (triggering OOMs)!
>
> * drmFromHDFS does not have an option to set the number of desired
> partitions
>
> * SparseRowMatrix with sequential vectors times SparseRowMatrix with
> sequential vectors is totally broken, it uses three nested loops and uses
> get(row, col) on the matrices, which internally uses binary search...
>
> * At operator adds the column vectors it creates, this is unnecessary as
> we don't need the addition, we can just merge the vectors
>
> * we need a dedicated operator for inCoreA %*% drmB, currently this gets
> rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have
> a prototype of that operator)
>
> Best,
> Sebastian
>
>
>


SparkBindings on a real cluster

2014-06-04 Thread Sebastian Schelter

Hi,

I did some experimentation with the spark bindings on a real cluster 
yesterday, as I had to run some experiments for a paper (unrelated to 
Mahout) that I'm currently writing. The experiment basically consists of 
multiplying a sparse data matrix by a super-sparse permutation-like 
matrix from the left. It took me the whole day to get it working, up to 
matrices with 500M entries.


I ran into lots of issues that we have to fix asap, unfortunately I 
don't have much time in the next weeks, so I'm just sharing a list of 
the issues that I ran into (maybe I'll find some time to create issues 
for these things on the weekend).


I think the major challenge for us will be to get choice of dense/sparse 
correct and put lots of work into memory efficiency. This could be a 
great hook for collaborating with the h20 folks, as they know how to 
make vector-like data small and computations fast.


Here's the list:

* our matrix serialization in MatrixWritable is seriously flawed, I ran 
into the following errors


  - the type information is stored with every vector although a matrix 
always only contains vectors of the same type
  - all entries of a TransposeView (and possibly other views) of a 
sparse matrix are serialized, resulting in OOM
  - for sparse row matrices, the vectors are set using assign instead 
of via constructor injection, this results in huge memory consumption 
and long creation times, as in some implementations, binary search is 
used for assignment


* a dense matrix is converted into a SparseRowMatrix with dense row 
vectors by blockify(), after serialization this becomes a dense matrix 
in sparse format (triggering OOMs)!


* drmFromHDFS does not have an option to set the number of desired 
partitions


* SparseRowMatrix with sequential vectors times SparseRowMatrix with 
sequential vectors is totally broken, it uses three nested loops and 
uses get(row, col) on the matrices, which internally uses binary search...


* At operator adds the column vectors it creates, this is unnecessary as 
we don't need the addition, we can just merge the vectors


* we need a dedicated operator for inCoreA %*% drmB, currently this gets 
rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I 
have a prototype of that operator)


Best,
Sebastian