Hi Makoto,

How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.

Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree

I used tree aggregate in this branch. It should help with the scalability.

Best,
Xiangrui

On Tue, Jun 17, 2014 at 12:22 PM, Makoto Yui <yuin...@gmail.com> wrote:
> Here is follow-up to the previous evaluation.
>
> "aggregate at GradientDescent.scala:178" never finishes at
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178
>
> We confirmed, by -verbose:gc, that GC is not happening during the aggregate
> and the cumulative CPU time for the task is increasing little by little.
>
> LBFGS also does not work for large # of features (news20.random.1000)
> though it works fine for small # of features (news20.binary.1000).
>
> "aggregate at LBFGS.scala:201" also never finishes at
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L201
>
> -----------------------------------------------------------------------
> [Evaluated code for LBFGS]
>
> import org.apache.spark.SparkContext
> import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
> import org.apache.spark.mllib.linalg.Vectors
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.mllib.classification.LogisticRegressionModel
> import org.apache.spark.mllib.optimization._
>
> val data = MLUtils.loadLibSVMFile(sc,
> "hdfs://dm01:8020/dataset/news20-binary/news20.random.1000",
> multiclass=false)
> val numFeatures = data.take(1)(0).features.size
>
> val training = data.map(x => (x.label, 
> MLUtils.appendBias(x.features))).cache()
>
> // Run training algorithm to build the model
> val numCorrections = 10
> val convergenceTol = 1e-4
> val maxNumIterations = 20
> val regParam = 0.1
> val initialWeightsWithIntercept = Vectors.dense(new
> Array[Double](numFeatures + 1))
>
> val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
>   training,
>   new LogisticGradient(),
>   new SquaredL2Updater(),
>   numCorrections,
>   convergenceTol,
>   maxNumIterations,
>   regParam,
>   initialWeightsWithIntercept)
> -----------------------------------------------------------------------
>
>
> Thanks,
> Makoto
>
> 2014-06-17 21:32 GMT+09:00 Makoto Yui <yuin...@gmail.com>:
>> Hello,
>>
>> I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on
>> Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though
>> the number of training examples used in the evaluation is just 1,000.
>>
>> It works fine for the dataset *news20.binary.1000* that has 178,560
>> features. However, it does not work for *news20.random.1000* where # of
>> features is large  (1,354,731 features) though we used a sparse vector
>> through MLUtils.loadLibSVMFile().
>>
>> The execution seems not progressing while no error is reported in the
>> spark-shell as well as in the stdout/stderr of executors.
>>
>> We used 32 executors with each allocating 7GB (2GB is for RDD) for
>> working memory.
>>
>> Any suggesions? Your help is really appreciated.
>>
>> ==============
>> Executed code
>> ==============
>> import org.apache.spark.mllib.util.MLUtils
>> import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
>>
>> //val training = MLUtils.loadLibSVMFile(sc,
>> "hdfs://host:8020/dataset/news20-binary/news20.binary.1000",
>> multiclass=false)
>> val training = MLUtils.loadLibSVMFile(sc,
>> "hdfs://host:8020/dataset/news20-binary/news20.random.1000",
>> multiclass=false)
>>
>> val numFeatures = training .take(1)(0).features.size
>> //numFeatures: Int = 178560 for news20.binary.1000
>> //numFeatures: Int = 1354731 for news20.random.1000
>> val model = LogisticRegressionWithSGD.train(training, numIterations=1)
>>
>> ==================================
>> The dataset used in the evaluation
>> ==================================
>>
>> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary
>>
>> $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' >
>> news20.binary.1000
>> $ sort -R news20.binary > news20.random
>> $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' >
>> news20.random.1000
>>
>> You can find the dataset in
>> https://dl.dropboxusercontent.com/u/13123103/news20.random.1000
>> https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000
>>
>>
>> Thanks,
>> Makoto

Reply via email to