Re: news20-binary classification with LogisticRegressionWithSGD

DB Tsai Tue, 17 Jun 2014 13:15:07 -0700

Hi Xiangrui,

What's different between treeAggregate and aggregate? Why
treeAggregate scales better? What if we just use mapPartition, will it
be as fast as treeAggregate?


Thanks.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Jun 17, 2014 at 12:58 PM, Xiangrui Meng <men...@gmail.com> wrote:
> Hi Makoto,
>
> How many partitions did you set? If there are too many partitions,
> please do a coalesce before calling ML algorithms.
>
> Btw, could you try the tree branch in my repo?
> https://github.com/mengxr/spark/tree/tree
>
> I used tree aggregate in this branch. It should help with the scalability.
>
> Best,
> Xiangrui
>
> On Tue, Jun 17, 2014 at 12:22 PM, Makoto Yui <yuin...@gmail.com> wrote:
>> Here is follow-up to the previous evaluation.
>>
>> "aggregate at GradientDescent.scala:178" never finishes at
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178
>>
>> We confirmed, by -verbose:gc, that GC is not happening during the aggregate
>> and the cumulative CPU time for the task is increasing little by little.
>>
>> LBFGS also does not work for large # of features (news20.random.1000)
>> though it works fine for small # of features (news20.binary.1000).
>>
>> "aggregate at LBFGS.scala:201" also never finishes at
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L201
>>
>> -----------------------------------------------------------------------
>> [Evaluated code for LBFGS]
>>
>> import org.apache.spark.SparkContext
>> import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
>> import org.apache.spark.mllib.linalg.Vectors
>> import org.apache.spark.mllib.util.MLUtils
>> import org.apache.spark.mllib.classification.LogisticRegressionModel
>> import org.apache.spark.mllib.optimization._
>>
>> val data = MLUtils.loadLibSVMFile(sc,
>> "hdfs://dm01:8020/dataset/news20-binary/news20.random.1000",
>> multiclass=false)
>> val numFeatures = data.take(1)(0).features.size
>>
>> val training = data.map(x => (x.label, 
>> MLUtils.appendBias(x.features))).cache()
>>
>> // Run training algorithm to build the model
>> val numCorrections = 10
>> val convergenceTol = 1e-4
>> val maxNumIterations = 20
>> val regParam = 0.1
>> val initialWeightsWithIntercept = Vectors.dense(new
>> Array[Double](numFeatures + 1))
>>
>> val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
>>   training,
>>   new LogisticGradient(),
>>   new SquaredL2Updater(),
>>   numCorrections,
>>   convergenceTol,
>>   maxNumIterations,
>>   regParam,
>>   initialWeightsWithIntercept)
>> -----------------------------------------------------------------------
>>
>>
>> Thanks,
>> Makoto
>>
>> 2014-06-17 21:32 GMT+09:00 Makoto Yui <yuin...@gmail.com>:
>>> Hello,
>>>
>>> I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on
>>> Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though
>>> the number of training examples used in the evaluation is just 1,000.
>>>
>>> It works fine for the dataset *news20.binary.1000* that has 178,560
>>> features. However, it does not work for *news20.random.1000* where # of
>>> features is large  (1,354,731 features) though we used a sparse vector
>>> through MLUtils.loadLibSVMFile().
>>>
>>> The execution seems not progressing while no error is reported in the
>>> spark-shell as well as in the stdout/stderr of executors.
>>>
>>> We used 32 executors with each allocating 7GB (2GB is for RDD) for
>>> working memory.
>>>
>>> Any suggesions? Your help is really appreciated.
>>>
>>> ==============
>>> Executed code
>>> ==============
>>> import org.apache.spark.mllib.util.MLUtils
>>> import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
>>>
>>> //val training = MLUtils.loadLibSVMFile(sc,
>>> "hdfs://host:8020/dataset/news20-binary/news20.binary.1000",
>>> multiclass=false)
>>> val training = MLUtils.loadLibSVMFile(sc,
>>> "hdfs://host:8020/dataset/news20-binary/news20.random.1000",
>>> multiclass=false)
>>>
>>> val numFeatures = training .take(1)(0).features.size
>>> //numFeatures: Int = 178560 for news20.binary.1000
>>> //numFeatures: Int = 1354731 for news20.random.1000
>>> val model = LogisticRegressionWithSGD.train(training, numIterations=1)
>>>
>>> ==================================
>>> The dataset used in the evaluation
>>> ==================================
>>>
>>> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary
>>>
>>> $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' >
>>> news20.binary.1000
>>> $ sort -R news20.binary > news20.random
>>> $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' >
>>> news20.random.1000
>>>
>>> You can find the dataset in
>>> https://dl.dropboxusercontent.com/u/13123103/news20.random.1000
>>> https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000
>>>
>>>
>>> Thanks,
>>> Makoto

Re: news20-binary classification with LogisticRegressionWithSGD

Reply via email to