Github user fommil commented on the pull request:
https://github.com/apache/incubator-spark/pull/575#issuecomment-35216981
@martinjaggi I believe you would be making a massive mistake by agreeing on
a single serialisation scheme for sparse vectors, unless that format is
independent of the storage structure, e.g. MatrixMarket, or if you're happy to
pass implementation detail of the sparse format along with the structure. BTW,
solving the generic problem of building a sparse matrix from a data stream is
far from solved... some structures are just intrinsically slow to construct and
others require up-front hints to avoid waste during their construction.
Most of the battle when doing sparse matrix calculations is choosing, or
writing, an appropriate data storage format that best suits the problem. Having
a toolkit of generic sparse storage formats is good, but orders of magnitude
improvement can be obtained with a custom made format. What works for you might
be really bad for somebody else.
There is no silver bullet sparse matrix format.
MTJ has compressed row, compressed column and compressed diagonal formats
(with some custom sparse structures for special matrices) and a special doubly
linked structure for unstructured sparse matrices.
Insertion is usually slow for structures that are optimised for
calculations. I recently created a compressed column structure holding a
hand-crafted primitive linked list... optimised for insertion speed, but
potentially duplicating entries, and definitely not optimal for row traversal
(although column traversal would be reasonable). If insertion is to be
performed concurrently, then that again requires more thought.
I don't know what you mean by sequential access because that depends on how
you're doing the sequential ordering: by row, column, diagonal, arrowhead? To
optimise calculations, you want the JIT to be able to treat everything as close
as possible to aligned arrays to help the CPU caches (which is by far the
biggest bottleneck in linear algebra). Exotic data structures can look great on
paper, but the JIT just gets confused and all of a sudden you're into RAM
access.
Btw, 1% sparsity is actually quite dense when you're talking about really
large matrices. I think whatever benchmarks you're coming up with, they are
probably going to be highly biased to the kind of problems that you're solving.
Creating a suite of benchmarks would be quite the undertaking!
Basically, I'm saying that Spark should remain flexible for future sparse
formats and shouldn't pick one at this early stage because it works well for
logistic regression.
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.