Any suggestion for sparser dataset? Will test more tomorrow in the office.
On Apr 23, 2014 9:33 PM, "Evan Sparks" <evan.spa...@gmail.com> wrote:

> Sorry - just saw the 11% number. That is around the spot where dense data
> is usually faster (blocking, cache coherence, etc) is there any chance you
> have a 1% (or so) sparse dataset to experiment with?
>
> > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote:
> >
> > Hi all,
> >
> > I'm benchmarking Logistic Regression in MLlib using the newly added
> optimizer LBFGS and GD. I'm using the same dataset and the same methodology
> in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> >
> > I want to know how Spark scale while adding workers, and how optimizers
> and input format (sparse or dense) impact performance.
> >
> > The benchmark code can be found here,
> https://github.com/dbtsai/spark-lbfgs-benchmark
> >
> > The first dataset I benchmarked is a9a which only has 2.2MB. I
> duplicated the dataset, and made it 762MB to have 11M rows. This dataset
> has 123 features and 11% of the data are non-zero elements.
> >
> > In this benchmark, all the dataset is cached in memory.
> >
> > As we expect, LBFGS converges faster than GD, and at some point, no
> matter how we push GD, it will converge slower and slower.
> >
> > However, it's surprising that sparse format runs slower than dense
> format. I did see that sparse format takes significantly smaller amount of
> memory in caching RDD, but sparse is 40% slower than dense. I think sparse
> should be fast since when we compute x wT, since x is sparse, we can do it
> faster. I wonder if there is anything I'm doing wrong.
> >
> > The attachment is the benchmark result.
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
>

Reply via email to