in my case, my model size is fairly small ( 100k training samples ), though
the features count is roughly 100k populated out of 10mil possible features.
in this case it does not help me to distribute the training process, since
data size is so small. I just need a good core solver to train the
That's a good point about shuffle data compression. Still, it would be good
to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I
think.
For many datasets, even within one partition the gradient sums etc can
remain very sparse. For example Criteo DAC data is extremely sparse