Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-19 Thread Yang
in my case, my model size is fairly small ( 100k training samples ), though the features count is roughly 100k populated out of 10mil possible features. in this case it does not help me to distribute the training process, since data size is so small. I just need a good core solver to train the

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-11 Thread Nick Pentreath
That's a good point about shuffle data compression. Still, it would be good to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I think. For many datasets, even within one partition the gradient sums etc can remain very sparse. For example Criteo DAC data is extremely sparse