Github user martinjaggi commented on the pull request:
https://github.com/apache/incubator-spark/pull/575#issuecomment-35448685
Thanks @mengxr for the benchmark efforts! Just not sure if you got my
comment about part 2) in the benchmark, k-means: In my opinion this algorithm
is not very unsuitable to judge the sparse vector overhead, since it's the only
method in MLlib currently that does *not* communicate the vectors (only the
dense centers). In contrast, all gradient based methods need to communicate the
sparse vectors in each iteration (of a MR). For these, often serialization can
take about the same time as taking the vector x vector product, which is all
the computation; so just saying that both are important in practice, but
currently we only benchmark one of the two, right?
Maybe things like that might have something to do with what @etrain ran
into with early sparse tests? Or do you guys think this is not an issue? I
would be curious to see how the candidates perform on some of the gradient
stuff, and like at which sparsity/load factor the sparse vectors will start
beating the dense vectors.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.
---