[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

martinjaggi Sun, 16 Feb 2014 12:59:22 -0800

Github user martinjaggi commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-35212055
  
    Really looking forward to having sparse vectors in MLlib soon, this is 
super important! And thanks for your efforts so far!
    
    Just a quick comment about the benchmarks and requirements:
    The biggest impact of sparse vectors will likely be in the 
classification&regression methods, where the theoretical speedup is linear with 
the sparsity of the vectors. 
    This is since the (sparse) vectors are all that is communicated in each 
round (e.g. in SGD). It's not only that the original data was sparse (as in the 
current k-means benchmark). To send such things over spark, super **fast 
serialization** is essential. It shouldn't be that hard to implement, since as 
@mengxr already mentioned, all we need here is sequential access sparse vectors 
(backed by two parallel arrays). But I see that it can be quite an architecture 
question.
    
    When comparing different implementations, I think it would therefore be 
convenient to see how they impact SGD, for example in logistic regression on 
some realistic data with 1% sparsity or so.
    
    Sanjay Krishnan had some good results with using `BidMat` as an 
implementation for exactly this, maybe we could ask him.



If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

Reply via email to