Hi Xiangrui, What's different between treeAggregate and aggregate? Why treeAggregate scales better? What if we just use mapPartition, will it be as fast as treeAggregate?
Thanks. Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jun 17, 2014 at 12:58 PM, Xiangrui Meng <men...@gmail.com> wrote: > Hi Makoto, > > How many partitions did you set? If there are too many partitions, > please do a coalesce before calling ML algorithms. > > Btw, could you try the tree branch in my repo? > https://github.com/mengxr/spark/tree/tree > > I used tree aggregate in this branch. It should help with the scalability. > > Best, > Xiangrui > > On Tue, Jun 17, 2014 at 12:22 PM, Makoto Yui <yuin...@gmail.com> wrote: >> Here is follow-up to the previous evaluation. >> >> "aggregate at GradientDescent.scala:178" never finishes at >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178 >> >> We confirmed, by -verbose:gc, that GC is not happening during the aggregate >> and the cumulative CPU time for the task is increasing little by little. >> >> LBFGS also does not work for large # of features (news20.random.1000) >> though it works fine for small # of features (news20.binary.1000). >> >> "aggregate at LBFGS.scala:201" also never finishes at >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L201 >> >> ----------------------------------------------------------------------- >> [Evaluated code for LBFGS] >> >> import org.apache.spark.SparkContext >> import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics >> import org.apache.spark.mllib.linalg.Vectors >> import org.apache.spark.mllib.util.MLUtils >> import org.apache.spark.mllib.classification.LogisticRegressionModel >> import org.apache.spark.mllib.optimization._ >> >> val data = MLUtils.loadLibSVMFile(sc, >> "hdfs://dm01:8020/dataset/news20-binary/news20.random.1000", >> multiclass=false) >> val numFeatures = data.take(1)(0).features.size >> >> val training = data.map(x => (x.label, >> MLUtils.appendBias(x.features))).cache() >> >> // Run training algorithm to build the model >> val numCorrections = 10 >> val convergenceTol = 1e-4 >> val maxNumIterations = 20 >> val regParam = 0.1 >> val initialWeightsWithIntercept = Vectors.dense(new >> Array[Double](numFeatures + 1)) >> >> val (weightsWithIntercept, loss) = LBFGS.runLBFGS( >> training, >> new LogisticGradient(), >> new SquaredL2Updater(), >> numCorrections, >> convergenceTol, >> maxNumIterations, >> regParam, >> initialWeightsWithIntercept) >> ----------------------------------------------------------------------- >> >> >> Thanks, >> Makoto >> >> 2014-06-17 21:32 GMT+09:00 Makoto Yui <yuin...@gmail.com>: >>> Hello, >>> >>> I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on >>> Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though >>> the number of training examples used in the evaluation is just 1,000. >>> >>> It works fine for the dataset *news20.binary.1000* that has 178,560 >>> features. However, it does not work for *news20.random.1000* where # of >>> features is large (1,354,731 features) though we used a sparse vector >>> through MLUtils.loadLibSVMFile(). >>> >>> The execution seems not progressing while no error is reported in the >>> spark-shell as well as in the stdout/stderr of executors. >>> >>> We used 32 executors with each allocating 7GB (2GB is for RDD) for >>> working memory. >>> >>> Any suggesions? Your help is really appreciated. >>> >>> ============== >>> Executed code >>> ============== >>> import org.apache.spark.mllib.util.MLUtils >>> import org.apache.spark.mllib.classification.LogisticRegressionWithSGD >>> >>> //val training = MLUtils.loadLibSVMFile(sc, >>> "hdfs://host:8020/dataset/news20-binary/news20.binary.1000", >>> multiclass=false) >>> val training = MLUtils.loadLibSVMFile(sc, >>> "hdfs://host:8020/dataset/news20-binary/news20.random.1000", >>> multiclass=false) >>> >>> val numFeatures = training .take(1)(0).features.size >>> //numFeatures: Int = 178560 for news20.binary.1000 >>> //numFeatures: Int = 1354731 for news20.random.1000 >>> val model = LogisticRegressionWithSGD.train(training, numIterations=1) >>> >>> ================================== >>> The dataset used in the evaluation >>> ================================== >>> >>> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary >>> >>> $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' > >>> news20.binary.1000 >>> $ sort -R news20.binary > news20.random >>> $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' > >>> news20.random.1000 >>> >>> You can find the dataset in >>> https://dl.dropboxusercontent.com/u/13123103/news20.random.1000 >>> https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000 >>> >>> >>> Thanks, >>> Makoto