Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

DB Tsai Sun, 27 Apr 2014 23:54:22 -0700

I think I figure it out. Instead of calling minimize, and record the loss
in the DiffFunction, I should do the following.


val states = lbfgs.iterations(new CachedDiffFunction(costFun),
initialWeights.toBreeze.toDenseVector)
states.foreach(state => lossHistory.append(state.value))

All the losses in states should be decreasing now. Am I right?



Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai <dbt...@stanford.edu> wrote:

> Also, how many failure of rejection will terminate the optimization
> process? How is it related to "numberOfImprovementFailures"?
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai <dbt...@stanford.edu> wrote:
>
>> Hi David,
>>
>> I'm recording the loss history in the DiffFunction implementation, and
>> that's why the rejected step is also recorded in my loss history.
>>
>> Is there any api in Breeze LBFGS to get the history which already
>> excludes the reject step? Or should I just call "iterations" method and
>> check "iteratingShouldStop" instead?
>>
>> Thanks.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Fri, Apr 25, 2014 at 3:10 PM, David Hall <d...@cs.berkeley.edu> wrote:
>>
>>> LBFGS will not take a step that sends the objective value up. It might
>>> try a step that is "too big" and reject it, so if you're just logging
>>> everything that gets tried by LBFGS, you could see that. The "iterations"
>>> method of the minimizer should never return an increasing objective value.
>>> If you're regularizing, are you including the regularizer in the objective
>>> value computation?
>>>
>>> GD is almost never worth your time.
>>>
>>> -- David
>>>
>>> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>
>>>> Another interesting benchmark.
>>>>
>>>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>>>> elements.*
>>>>
>>>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>>>
>>>> Dense feature vector will be too big to fit in the memory, so only
>>>> conduct the sparse benchmark.
>>>>
>>>> I saw the sometimes the loss bumps up, and it's weird for me. Since the
>>>> cost function of logistic regression is convex, it should be monotonically
>>>> decreasing.  David, any suggestion?
>>>>
>>>> The detail figure:
>>>>
>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>>>>
>>>>
>>>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>>>>
>>>> LBFGS converges in 25 seconds, while GD also seems to be not
>>>> progressing.
>>>>
>>>> Only conduct sparse benchmark for the same reason. I also saw the loss
>>>> bumps up for unknown reason.
>>>>
>>>> The detail figure:
>>>>
>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> -------------------------------------------------------
>>>> My Blog: https://www.dbtsai.com
>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>
>>>>
>>>> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>>
>>>>> rcv1.binary is too sparse (0.15% non-zero elements), so dense format
>>>>> will not run due to out of memory. But sparse format runs really well.
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> DB Tsai
>>>>> -------------------------------------------------------
>>>>> My Blog: https://www.dbtsai.com
>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>
>>>>>
>>>>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>>>
>>>>>> I'm doing the timer in runMiniBatchSGD after  val numExamples =
>>>>>> data.count()
>>>>>>
>>>>>> See the following. Running rcv1 dataset now, and will update soon.
>>>>>>
>>>>>>     val startTime = System.nanoTime()
>>>>>>     for (i <- 1 to numIterations) {
>>>>>>       // Sample a subset (fraction miniBatchFraction) of the total
>>>>>> data
>>>>>>       // compute and sum up the subgradients on this subset (this is
>>>>>> one map-reduce)
>>>>>>       val (gradientSum, lossSum) = data.sample(false,
>>>>>> miniBatchFraction, 42 + i)
>>>>>>         .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>>>>>>           seqOp = (c, v) => (c, v) match { case ((grad, loss),
>>>>>> (label, features)) =>
>>>>>>             val l = gradient.compute(features, label, weights,
>>>>>> Vectors.fromBreeze(grad))
>>>>>>             (grad, loss + l)
>>>>>>           },
>>>>>>           combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1),
>>>>>> (grad2, loss2)) =>
>>>>>>             (grad1 += grad2, loss1 + loss2)
>>>>>>           })
>>>>>>
>>>>>>       /**
>>>>>>        * NOTE(Xinghao): lossSum is computed using the weights from
>>>>>> the previous iteration
>>>>>>        * and regVal is the regularization value computed in the
>>>>>> previous iteration as well.
>>>>>>        */
>>>>>>       stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
>>>>>>       val update = updater.compute(
>>>>>>         weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
>>>>>> stepSize, i, regParam)
>>>>>>       weights = update._1
>>>>>>       regVal = update._2
>>>>>>       timeStamp.append(System.nanoTime() - startTime)
>>>>>>     }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> DB Tsai
>>>>>> -------------------------------------------------------
>>>>>> My Blog: https://www.dbtsai.com
>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <men...@gmail.com>wrote:
>>>>>>
>>>>>>> I don't understand why sparse falls behind dense so much at the very
>>>>>>> first iteration. I didn't see count() is called in
>>>>>>>
>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
>>>>>>> . Maybe you have local uncommitted changes.
>>>>>>>
>>>>>>> Best,
>>>>>>> Xiangrui
>>>>>>>
>>>>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu>
>>>>>>> wrote:
>>>>>>> > Hi Xiangrui,
>>>>>>> >
>>>>>>> > Yes, I'm using yarn-cluster mode, and I did check # of executors I
>>>>>>> specified
>>>>>>> > are the same as the actual running executors.
>>>>>>> >
>>>>>>> > For caching and materialization, I've the timer in optimizer after
>>>>>>> calling
>>>>>>> > count(); as a result, the time for materialization in cache isn't
>>>>>>> in the
>>>>>>> > benchmark.
>>>>>>> >
>>>>>>> > The difference you saw is actually from dense feature or sparse
>>>>>>> feature
>>>>>>> > vector. For LBFGS and GD dense feature, you can see the first
>>>>>>> iteration
>>>>>>> > takes the same time. It's true for GD.
>>>>>>> >
>>>>>>> > I'm going to run rcv1.binary which only has 0.15% non-zero
>>>>>>> elements to
>>>>>>> > verify the hypothesis.
>>>>>>> >
>>>>>>> >
>>>>>>> > Sincerely,
>>>>>>> >
>>>>>>> > DB Tsai
>>>>>>> > -------------------------------------------------------
>>>>>>> > My Blog: https://www.dbtsai.com
>>>>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Hi DB,
>>>>>>> >>
>>>>>>> >> I saw you are using yarn-cluster mode for the benchmark. I tested
>>>>>>> the
>>>>>>> >> yarn-cluster mode and found that YARN does not always give you the
>>>>>>> >> exact number of executors requested. Just want to confirm that
>>>>>>> you've
>>>>>>> >> checked the number of executors.
>>>>>>> >>
>>>>>>> >> The second thing to check is that in the benchmark code, after you
>>>>>>> >> call cache, you should also call count() to materialize the RDD.
>>>>>>> I saw
>>>>>>> >> in the result, the real difference is actually at the first step.
>>>>>>> >> Adding intercept is not a cheap operation for sparse vectors.
>>>>>>> >>
>>>>>>> >> Best,
>>>>>>> >> Xiangrui
>>>>>>> >>
>>>>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <men...@gmail.com>
>>>>>>> wrote:
>>>>>>> >> > I don't think it is easy to make sparse faster than dense with
>>>>>>> this
>>>>>>> >> > sparsity and feature dimension. You can try rcv1.binary, which
>>>>>>> should
>>>>>>> >> > show the difference easily.
>>>>>>> >> >
>>>>>>> >> > David, the breeze operators used here are
>>>>>>> >> >
>>>>>>> >> > 1. DenseVector dot SparseVector
>>>>>>> >> > 2. axpy DenseVector SparseVector
>>>>>>> >> >
>>>>>>> >> > However, the SparseVector is passed in as Vector[Double]
>>>>>>> instead of
>>>>>>> >> > SparseVector[Double]. It might use the axpy impl of
>>>>>>> [DenseVector,
>>>>>>> >> > Vector] and call activeIterator. I didn't check whether you used
>>>>>>> >> > multimethods on axpy.
>>>>>>> >> >
>>>>>>> >> > Best,
>>>>>>> >> > Xiangrui
>>>>>>> >> >
>>>>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu>
>>>>>>> wrote:
>>>>>>> >> >> The figure showing the Log-Likelihood vs Time can be found
>>>>>>> here.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>>>>>> >> >>
>>>>>>> >> >> Let me know if you can not open it. Thanks.
>>>>>>> >> >>
>>>>>>> >> >> Sincerely,
>>>>>>> >> >>
>>>>>>> >> >> DB Tsai
>>>>>>> >> >> -------------------------------------------------------
>>>>>>> >> >> My Blog: https://www.dbtsai.com
>>>>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>>>>>>> >> >> <shiva...@eecs.berkeley.edu> wrote:
>>>>>>> >> >>> I don't think the attachment came through in the list. Could
>>>>>>> you
>>>>>>> >> >>> upload the
>>>>>>> >> >>> results somewhere and link to them ?
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com>
>>>>>>> wrote:
>>>>>>> >> >>>>
>>>>>>> >> >>>> 123 features per rows, and in average, 89% are zeros.
>>>>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <
>>>>>>> evan.spa...@gmail.com> wrote:
>>>>>>> >> >>>>
>>>>>>> >> >>>> > What is the number of non zeroes per row (and number of
>>>>>>> features)
>>>>>>> >> >>>> > in the
>>>>>>> >> >>>> > sparse case? We've hit some issues with breeze sparse
>>>>>>> support in
>>>>>>> >> >>>> > the
>>>>>>> >> >>>> > past
>>>>>>> >> >>>> > but for sufficiently sparse data it's still pretty good.
>>>>>>> >> >>>> >
>>>>>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <
>>>>>>> dbt...@stanford.edu> wrote:
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > Hi all,
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using the
>>>>>>> newly
>>>>>>> >> >>>> > > added
>>>>>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and the
>>>>>>> same
>>>>>>> >> >>>> > methodology
>>>>>>> >> >>>> > in this paper,
>>>>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > I want to know how Spark scale while adding workers, and
>>>>>>> how
>>>>>>> >> >>>> > > optimizers
>>>>>>> >> >>>> > and input format (sparse or dense) impact performance.
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > The benchmark code can be found here,
>>>>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > The first dataset I benchmarked is a9a which only has
>>>>>>> 2.2MB. I
>>>>>>> >> >>>> > duplicated the dataset, and made it 762MB to have 11M
>>>>>>> rows. This
>>>>>>> >> >>>> > dataset
>>>>>>> >> >>>> > has 123 features and 11% of the data are non-zero elements.
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > In this benchmark, all the dataset is cached in memory.
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > As we expect, LBFGS converges faster than GD, and at
>>>>>>> some point,
>>>>>>> >> >>>> > > no
>>>>>>> >> >>>> > matter how we push GD, it will converge slower and slower.
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > However, it's surprising that sparse format runs slower
>>>>>>> than
>>>>>>> >> >>>> > > dense
>>>>>>> >> >>>> > format. I did see that sparse format takes significantly
>>>>>>> smaller
>>>>>>> >> >>>> > amount
>>>>>>> >> >>>> > of
>>>>>>> >> >>>> > memory in caching RDD, but sparse is 40% slower than
>>>>>>> dense. I think
>>>>>>> >> >>>> > sparse
>>>>>>> >> >>>> > should be fast since when we compute x wT, since x is
>>>>>>> sparse, we
>>>>>>> >> >>>> > can do
>>>>>>> >> >>>> > it
>>>>>>> >> >>>> > faster. I wonder if there is anything I'm doing wrong.
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > The attachment is the benchmark result.
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > Thanks.
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > Sincerely,
>>>>>>> >> >>>> > >
>>>>>>> >> >>>> > > DB Tsai
>>>>>>> >> >>>> > > -------------------------------------------------------
>>>>>>> >> >>>> > > My Blog: https://www.dbtsai.com
>>>>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>> >> >>>> >
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

Reply via email to