Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
, Jiusheng Chen chenjiush...@gmail.com wrote: How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong dong...@gmail.com wrote: Hi all, We are trying to use Spark MLlib

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Jiusheng Chen
PM, Jiusheng Chen chenjiush...@gmail.com wrote: Hi Xiangrui, A side-by question about MLLib. It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support L2 regurization, the doc explains it: The L1 regularization by using L1Updater http://spark.apache.org/docs/latest/api

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread Jiusheng Chen
How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong dong...@gmail.com wrote: Hi all, We are trying to use Spark MLlib to train super large data (100M features and 5B rows). The input data in HDFS

LabeledPoint with weight

2014-07-21 Thread Jiusheng Chen
It seems MLlib right now doesn't support weighted training, training samples have equal importance. Weighted training can be very useful to reduce data size and speed up training. Do you have plan to support it in future? The data format will be something like: label:*weight * index1:value1