I think I figure it out. Instead of calling minimize, and record the loss in the DiffFunction, I should do the following.
val states = lbfgs.iterations(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector) states.foreach(state => lossHistory.append(state.value)) All the losses in states should be decreasing now. Am I right? Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai <dbt...@stanford.edu> wrote: > Also, how many failure of rejection will terminate the optimization > process? How is it related to "numberOfImprovementFailures"? > > Thanks. > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai <dbt...@stanford.edu> wrote: > >> Hi David, >> >> I'm recording the loss history in the DiffFunction implementation, and >> that's why the rejected step is also recorded in my loss history. >> >> Is there any api in Breeze LBFGS to get the history which already >> excludes the reject step? Or should I just call "iterations" method and >> check "iteratingShouldStop" instead? >> >> Thanks. >> >> >> Sincerely, >> >> DB Tsai >> ------------------------------------------------------- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Fri, Apr 25, 2014 at 3:10 PM, David Hall <d...@cs.berkeley.edu> wrote: >> >>> LBFGS will not take a step that sends the objective value up. It might >>> try a step that is "too big" and reject it, so if you're just logging >>> everything that gets tried by LBFGS, you could see that. The "iterations" >>> method of the minimizer should never return an increasing objective value. >>> If you're regularizing, are you including the regularizer in the objective >>> value computation? >>> >>> GD is almost never worth your time. >>> >>> -- David >>> >>> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <dbt...@stanford.edu> wrote: >>> >>>> Another interesting benchmark. >>>> >>>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero >>>> elements.* >>>> >>>> LBFGS converges in 70 seconds, while GD seems to be not progressing. >>>> >>>> Dense feature vector will be too big to fit in the memory, so only >>>> conduct the sparse benchmark. >>>> >>>> I saw the sometimes the loss bumps up, and it's weird for me. Since the >>>> cost function of logistic regression is convex, it should be monotonically >>>> decreasing. David, any suggestion? >>>> >>>> The detail figure: >>>> >>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf >>>> >>>> >>>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.* >>>> >>>> LBFGS converges in 25 seconds, while GD also seems to be not >>>> progressing. >>>> >>>> Only conduct sparse benchmark for the same reason. I also saw the loss >>>> bumps up for unknown reason. >>>> >>>> The detail figure: >>>> >>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf >>>> >>>> >>>> Sincerely, >>>> >>>> DB Tsai >>>> ------------------------------------------------------- >>>> My Blog: https://www.dbtsai.com >>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>> >>>> >>>> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbt...@stanford.edu> wrote: >>>> >>>>> rcv1.binary is too sparse (0.15% non-zero elements), so dense format >>>>> will not run due to out of memory. But sparse format runs really well. >>>>> >>>>> >>>>> Sincerely, >>>>> >>>>> DB Tsai >>>>> ------------------------------------------------------- >>>>> My Blog: https://www.dbtsai.com >>>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>> >>>>> >>>>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbt...@stanford.edu> wrote: >>>>> >>>>>> I'm doing the timer in runMiniBatchSGD after val numExamples = >>>>>> data.count() >>>>>> >>>>>> See the following. Running rcv1 dataset now, and will update soon. >>>>>> >>>>>> val startTime = System.nanoTime() >>>>>> for (i <- 1 to numIterations) { >>>>>> // Sample a subset (fraction miniBatchFraction) of the total >>>>>> data >>>>>> // compute and sum up the subgradients on this subset (this is >>>>>> one map-reduce) >>>>>> val (gradientSum, lossSum) = data.sample(false, >>>>>> miniBatchFraction, 42 + i) >>>>>> .aggregate((BDV.zeros[Double](weights.size), 0.0))( >>>>>> seqOp = (c, v) => (c, v) match { case ((grad, loss), >>>>>> (label, features)) => >>>>>> val l = gradient.compute(features, label, weights, >>>>>> Vectors.fromBreeze(grad)) >>>>>> (grad, loss + l) >>>>>> }, >>>>>> combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), >>>>>> (grad2, loss2)) => >>>>>> (grad1 += grad2, loss1 + loss2) >>>>>> }) >>>>>> >>>>>> /** >>>>>> * NOTE(Xinghao): lossSum is computed using the weights from >>>>>> the previous iteration >>>>>> * and regVal is the regularization value computed in the >>>>>> previous iteration as well. >>>>>> */ >>>>>> stochasticLossHistory.append(lossSum / miniBatchSize + regVal) >>>>>> val update = updater.compute( >>>>>> weights, Vectors.fromBreeze(gradientSum / miniBatchSize), >>>>>> stepSize, i, regParam) >>>>>> weights = update._1 >>>>>> regVal = update._2 >>>>>> timeStamp.append(System.nanoTime() - startTime) >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Sincerely, >>>>>> >>>>>> DB Tsai >>>>>> ------------------------------------------------------- >>>>>> My Blog: https://www.dbtsai.com >>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>> >>>>>> >>>>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <men...@gmail.com>wrote: >>>>>> >>>>>>> I don't understand why sparse falls behind dense so much at the very >>>>>>> first iteration. I didn't see count() is called in >>>>>>> >>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala >>>>>>> . Maybe you have local uncommitted changes. >>>>>>> >>>>>>> Best, >>>>>>> Xiangrui >>>>>>> >>>>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu> >>>>>>> wrote: >>>>>>> > Hi Xiangrui, >>>>>>> > >>>>>>> > Yes, I'm using yarn-cluster mode, and I did check # of executors I >>>>>>> specified >>>>>>> > are the same as the actual running executors. >>>>>>> > >>>>>>> > For caching and materialization, I've the timer in optimizer after >>>>>>> calling >>>>>>> > count(); as a result, the time for materialization in cache isn't >>>>>>> in the >>>>>>> > benchmark. >>>>>>> > >>>>>>> > The difference you saw is actually from dense feature or sparse >>>>>>> feature >>>>>>> > vector. For LBFGS and GD dense feature, you can see the first >>>>>>> iteration >>>>>>> > takes the same time. It's true for GD. >>>>>>> > >>>>>>> > I'm going to run rcv1.binary which only has 0.15% non-zero >>>>>>> elements to >>>>>>> > verify the hypothesis. >>>>>>> > >>>>>>> > >>>>>>> > Sincerely, >>>>>>> > >>>>>>> > DB Tsai >>>>>>> > ------------------------------------------------------- >>>>>>> > My Blog: https://www.dbtsai.com >>>>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>>> > >>>>>>> > >>>>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> Hi DB, >>>>>>> >> >>>>>>> >> I saw you are using yarn-cluster mode for the benchmark. I tested >>>>>>> the >>>>>>> >> yarn-cluster mode and found that YARN does not always give you the >>>>>>> >> exact number of executors requested. Just want to confirm that >>>>>>> you've >>>>>>> >> checked the number of executors. >>>>>>> >> >>>>>>> >> The second thing to check is that in the benchmark code, after you >>>>>>> >> call cache, you should also call count() to materialize the RDD. >>>>>>> I saw >>>>>>> >> in the result, the real difference is actually at the first step. >>>>>>> >> Adding intercept is not a cheap operation for sparse vectors. >>>>>>> >> >>>>>>> >> Best, >>>>>>> >> Xiangrui >>>>>>> >> >>>>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <men...@gmail.com> >>>>>>> wrote: >>>>>>> >> > I don't think it is easy to make sparse faster than dense with >>>>>>> this >>>>>>> >> > sparsity and feature dimension. You can try rcv1.binary, which >>>>>>> should >>>>>>> >> > show the difference easily. >>>>>>> >> > >>>>>>> >> > David, the breeze operators used here are >>>>>>> >> > >>>>>>> >> > 1. DenseVector dot SparseVector >>>>>>> >> > 2. axpy DenseVector SparseVector >>>>>>> >> > >>>>>>> >> > However, the SparseVector is passed in as Vector[Double] >>>>>>> instead of >>>>>>> >> > SparseVector[Double]. It might use the axpy impl of >>>>>>> [DenseVector, >>>>>>> >> > Vector] and call activeIterator. I didn't check whether you used >>>>>>> >> > multimethods on axpy. >>>>>>> >> > >>>>>>> >> > Best, >>>>>>> >> > Xiangrui >>>>>>> >> > >>>>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu> >>>>>>> wrote: >>>>>>> >> >> The figure showing the Log-Likelihood vs Time can be found >>>>>>> here. >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf >>>>>>> >> >> >>>>>>> >> >> Let me know if you can not open it. Thanks. >>>>>>> >> >> >>>>>>> >> >> Sincerely, >>>>>>> >> >> >>>>>>> >> >> DB Tsai >>>>>>> >> >> ------------------------------------------------------- >>>>>>> >> >> My Blog: https://www.dbtsai.com >>>>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman >>>>>>> >> >> <shiva...@eecs.berkeley.edu> wrote: >>>>>>> >> >>> I don't think the attachment came through in the list. Could >>>>>>> you >>>>>>> >> >>> upload the >>>>>>> >> >>> results somewhere and link to them ? >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> >>>>>>> wrote: >>>>>>> >> >>>> >>>>>>> >> >>>> 123 features per rows, and in average, 89% are zeros. >>>>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" < >>>>>>> evan.spa...@gmail.com> wrote: >>>>>>> >> >>>> >>>>>>> >> >>>> > What is the number of non zeroes per row (and number of >>>>>>> features) >>>>>>> >> >>>> > in the >>>>>>> >> >>>> > sparse case? We've hit some issues with breeze sparse >>>>>>> support in >>>>>>> >> >>>> > the >>>>>>> >> >>>> > past >>>>>>> >> >>>> > but for sufficiently sparse data it's still pretty good. >>>>>>> >> >>>> > >>>>>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai < >>>>>>> dbt...@stanford.edu> wrote: >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > Hi all, >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using the >>>>>>> newly >>>>>>> >> >>>> > > added >>>>>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and the >>>>>>> same >>>>>>> >> >>>> > methodology >>>>>>> >> >>>> > in this paper, >>>>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > I want to know how Spark scale while adding workers, and >>>>>>> how >>>>>>> >> >>>> > > optimizers >>>>>>> >> >>>> > and input format (sparse or dense) impact performance. >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > The benchmark code can be found here, >>>>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > The first dataset I benchmarked is a9a which only has >>>>>>> 2.2MB. I >>>>>>> >> >>>> > duplicated the dataset, and made it 762MB to have 11M >>>>>>> rows. This >>>>>>> >> >>>> > dataset >>>>>>> >> >>>> > has 123 features and 11% of the data are non-zero elements. >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > In this benchmark, all the dataset is cached in memory. >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > As we expect, LBFGS converges faster than GD, and at >>>>>>> some point, >>>>>>> >> >>>> > > no >>>>>>> >> >>>> > matter how we push GD, it will converge slower and slower. >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > However, it's surprising that sparse format runs slower >>>>>>> than >>>>>>> >> >>>> > > dense >>>>>>> >> >>>> > format. I did see that sparse format takes significantly >>>>>>> smaller >>>>>>> >> >>>> > amount >>>>>>> >> >>>> > of >>>>>>> >> >>>> > memory in caching RDD, but sparse is 40% slower than >>>>>>> dense. I think >>>>>>> >> >>>> > sparse >>>>>>> >> >>>> > should be fast since when we compute x wT, since x is >>>>>>> sparse, we >>>>>>> >> >>>> > can do >>>>>>> >> >>>> > it >>>>>>> >> >>>> > faster. I wonder if there is anything I'm doing wrong. >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > The attachment is the benchmark result. >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > Thanks. >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > Sincerely, >>>>>>> >> >>>> > > >>>>>>> >> >>>> > > DB Tsai >>>>>>> >> >>>> > > ------------------------------------------------------- >>>>>>> >> >>>> > > My Blog: https://www.dbtsai.com >>>>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>>> >> >>>> > >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> > >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >