Re: Is there any way to control the parallelism in LogisticRegression

2014-09-06 Thread DB Tsai
Yes. But you need to store RDD as *serialized* Java objects. See the session of storage level http://spark.apache.org/docs/latest/programming-guide.html Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn:

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-04 Thread DB Tsai
For saving the memory, I recommend you compress the cached RDD, and it will be couple times smaller than original data sets. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3,

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
Hi Xiangrui, A side-by question about MLLib. It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support L2 regurization, the doc explains it: The L1 regularization by using L1Updater

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Xiangrui Meng
+DB David (They implemented QWLQN on Spark today.) On Sep 3, 2014 7:18 PM, Jiusheng Chen chenjiush...@gmail.com wrote: Hi Xiangrui, A side-by question about MLLib. It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support L2 regurization, the doc explains it: The L1

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread DB Tsai
With David's help today, we were able to implement elastic net glm in Spark. It's surprising easy, and with just some modification in breeze's OWLQN code, it just works without further investigation. We did benchmark, and the coefficients are within 0.5% differences compared with R's glmnet

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
Thanks DB and Xiangrui. Glad to know you guys are actively working on it. Another thing, did we evaluate the loss of using Float to store values? currently it is Double. Use fewer bits has the benifit of memory footprint reduction. According to Google, they even uses 16 bits (a special encoding

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-21 Thread ZHENG, Xu-dong
Update. I just find a magic parameter *blanceSlack* in *CoalescedRDD*, which sounds could control the locality. The default value is 0.1 (smaller value means lower locality). I change it to 1.0 (full locality) and use #3 approach, then find a lot improvement (20%~40%). Although the Web UI still

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Assuming that your data is very sparse, I would recommend RDD.repartition. But if it is not the case and you don't want to shuffle the data, you can try a CombineInputFormat and then parse the lines into labeled points. Coalesce may cause locality problems if you didn't use the right number of

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread ZHENG, Xu-dong
Hi Xiangrui, Thanks for your reply! Yes, our data is very sparse, but RDD.repartition invoke RDD.coalesce(numPartitions, shuffle = true) internally, so I think it has the same effect with #2, right? For CombineInputFormat, although I haven't tried it, but it sounds that it will combine multiple

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Sorry, I missed #2. My suggestion is the same as #2. You need to set a bigger numPartitions to avoid hitting integer bound or 2G limitation, at the cost of increased shuffle size per iteration. If you use a CombineInputFormat and then cache, it will try to give you roughly the same size per

Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
Hi all, We are trying to use Spark MLlib to train super large data (100M features and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib will create a task for every partition at each iteration. But because our dimensions are also very high, such large number of tasks will

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread Jiusheng Chen
How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong dong...@gmail.com wrote: Hi all, We are trying to use Spark MLlib to train super large data (100M features and 5B rows). The input data in HDFS

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
I think this has the same effect and issue with #1, right? On Tue, Aug 12, 2014 at 1:08 PM, Jiusheng Chen chenjiush...@gmail.com wrote: How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong