What is the number of non zeroes per row (and number of features) in the sparse 
case? We've hit some issues with breeze sparse support in the past but for 
sufficiently sparse data it's still pretty good. 

> On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote:
> 
> Hi all,
> 
> I'm benchmarking Logistic Regression in MLlib using the newly added optimizer 
> LBFGS and GD. I'm using the same dataset and the same methodology in this 
> paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> 
> I want to know how Spark scale while adding workers, and how optimizers and 
> input format (sparse or dense) impact performance. 
> 
> The benchmark code can be found here, 
> https://github.com/dbtsai/spark-lbfgs-benchmark
> 
> The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the 
> dataset, and made it 762MB to have 11M rows. This dataset has 123 features 
> and 11% of the data are non-zero elements. 
> 
> In this benchmark, all the dataset is cached in memory.
> 
> As we expect, LBFGS converges faster than GD, and at some point, no matter 
> how we push GD, it will converge slower and slower. 
> 
> However, it's surprising that sparse format runs slower than dense format. I 
> did see that sparse format takes significantly smaller amount of memory in 
> caching RDD, but sparse is 40% slower than dense. I think sparse should be 
> fast since when we compute x wT, since x is sparse, we can do it faster. I 
> wonder if there is anything I'm doing wrong. 
> 
> The attachment is the benchmark result.
> 
> Thanks.  
> 
> Sincerely,
> 
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai

Reply via email to