Hi Makoto, How many partitions did you set? If there are too many partitions, please do a coalesce before calling ML algorithms.
Btw, could you try the tree branch in my repo? https://github.com/mengxr/spark/tree/tree I used tree aggregate in this branch. It should help with the scalability. Best, Xiangrui On Tue, Jun 17, 2014 at 12:22 PM, Makoto Yui <yuin...@gmail.com> wrote: > Here is follow-up to the previous evaluation. > > "aggregate at GradientDescent.scala:178" never finishes at > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178 > > We confirmed, by -verbose:gc, that GC is not happening during the aggregate > and the cumulative CPU time for the task is increasing little by little. > > LBFGS also does not work for large # of features (news20.random.1000) > though it works fine for small # of features (news20.binary.1000). > > "aggregate at LBFGS.scala:201" also never finishes at > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L201 > > ----------------------------------------------------------------------- > [Evaluated code for LBFGS] > > import org.apache.spark.SparkContext > import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics > import org.apache.spark.mllib.linalg.Vectors > import org.apache.spark.mllib.util.MLUtils > import org.apache.spark.mllib.classification.LogisticRegressionModel > import org.apache.spark.mllib.optimization._ > > val data = MLUtils.loadLibSVMFile(sc, > "hdfs://dm01:8020/dataset/news20-binary/news20.random.1000", > multiclass=false) > val numFeatures = data.take(1)(0).features.size > > val training = data.map(x => (x.label, > MLUtils.appendBias(x.features))).cache() > > // Run training algorithm to build the model > val numCorrections = 10 > val convergenceTol = 1e-4 > val maxNumIterations = 20 > val regParam = 0.1 > val initialWeightsWithIntercept = Vectors.dense(new > Array[Double](numFeatures + 1)) > > val (weightsWithIntercept, loss) = LBFGS.runLBFGS( > training, > new LogisticGradient(), > new SquaredL2Updater(), > numCorrections, > convergenceTol, > maxNumIterations, > regParam, > initialWeightsWithIntercept) > ----------------------------------------------------------------------- > > > Thanks, > Makoto > > 2014-06-17 21:32 GMT+09:00 Makoto Yui <yuin...@gmail.com>: >> Hello, >> >> I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on >> Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though >> the number of training examples used in the evaluation is just 1,000. >> >> It works fine for the dataset *news20.binary.1000* that has 178,560 >> features. However, it does not work for *news20.random.1000* where # of >> features is large (1,354,731 features) though we used a sparse vector >> through MLUtils.loadLibSVMFile(). >> >> The execution seems not progressing while no error is reported in the >> spark-shell as well as in the stdout/stderr of executors. >> >> We used 32 executors with each allocating 7GB (2GB is for RDD) for >> working memory. >> >> Any suggesions? Your help is really appreciated. >> >> ============== >> Executed code >> ============== >> import org.apache.spark.mllib.util.MLUtils >> import org.apache.spark.mllib.classification.LogisticRegressionWithSGD >> >> //val training = MLUtils.loadLibSVMFile(sc, >> "hdfs://host:8020/dataset/news20-binary/news20.binary.1000", >> multiclass=false) >> val training = MLUtils.loadLibSVMFile(sc, >> "hdfs://host:8020/dataset/news20-binary/news20.random.1000", >> multiclass=false) >> >> val numFeatures = training .take(1)(0).features.size >> //numFeatures: Int = 178560 for news20.binary.1000 >> //numFeatures: Int = 1354731 for news20.random.1000 >> val model = LogisticRegressionWithSGD.train(training, numIterations=1) >> >> ================================== >> The dataset used in the evaluation >> ================================== >> >> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary >> >> $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' > >> news20.binary.1000 >> $ sort -R news20.binary > news20.random >> $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' > >> news20.random.1000 >> >> You can find the dataset in >> https://dl.dropboxusercontent.com/u/13123103/news20.random.1000 >> https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000 >> >> >> Thanks, >> Makoto