Github user martinjaggi commented on the pull request:
https://github.com/apache/incubator-spark/pull/575#issuecomment-35212055
Really looking forward to having sparse vectors in MLlib soon, this is
super important! And thanks for your efforts so far!
Just a quick comment about the benchmarks and requirements:
The biggest impact of sparse vectors will likely be in the
classification®ression methods, where the theoretical speedup is linear with
the sparsity of the vectors.
This is since the (sparse) vectors are all that is communicated in each
round (e.g. in SGD). It's not only that the original data was sparse (as in the
current k-means benchmark). To send such things over spark, super **fast
serialization** is essential. It shouldn't be that hard to implement, since as
@mengxr already mentioned, all we need here is sequential access sparse vectors
(backed by two parallel arrays). But I see that it can be quite an architecture
question.
When comparing different implementations, I think it would therefore be
convenient to see how they impact SGD, for example in logistic regression on
some realistic data with 1% sparsity or so.
Sanjay Krishnan had some good results with using `BidMat` as an
implementation for exactly this, maybe we could ask him.
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.