Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

David Hall Fri, 25 Apr 2014 15:10:54 -0700

LBFGS will not take a step that sends the objective value up. It might try
a step that is "too big" and reject it, so if you're just logging
everything that gets tried by LBFGS, you could see that. The "iterations"
method of the minimizer should never return an increasing objective value.
If you're regularizing, are you including the regularizer in the objective
value computation?


GD is almost never worth your time.

-- David

On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <dbt...@stanford.edu> wrote:

> Another interesting benchmark.
>
> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero elements.*
>
> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>
> Dense feature vector will be too big to fit in the memory, so only conduct
> the sparse benchmark.
>
> I saw the sometimes the loss bumps up, and it's weird for me. Since the
> cost function of logistic regression is convex, it should be monotonically
> decreasing.  David, any suggestion?
>
> The detail figure:
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>
>
> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>
> LBFGS converges in 25 seconds, while GD also seems to be not progressing.
>
> Only conduct sparse benchmark for the same reason. I also saw the loss
> bumps up for unknown reason.
>
> The detail figure:
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbt...@stanford.edu> wrote:
>
>> rcv1.binary is too sparse (0.15% non-zero elements), so dense format
>> will not run due to out of memory. But sparse format runs really well.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>
>>> I'm doing the timer in runMiniBatchSGD after  val numExamples =
>>> data.count()
>>>
>>> See the following. Running rcv1 dataset now, and will update soon.
>>>
>>>     val startTime = System.nanoTime()
>>>     for (i <- 1 to numIterations) {
>>>       // Sample a subset (fraction miniBatchFraction) of the total data
>>>       // compute and sum up the subgradients on this subset (this is one
>>> map-reduce)
>>>       val (gradientSum, lossSum) = data.sample(false, miniBatchFraction,
>>> 42 + i)
>>>         .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>>>           seqOp = (c, v) => (c, v) match { case ((grad, loss), (label,
>>> features)) =>
>>>             val l = gradient.compute(features, label, weights,
>>> Vectors.fromBreeze(grad))
>>>             (grad, loss + l)
>>>           },
>>>           combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1),
>>> (grad2, loss2)) =>
>>>             (grad1 += grad2, loss1 + loss2)
>>>           })
>>>
>>>       /**
>>>        * NOTE(Xinghao): lossSum is computed using the weights from the
>>> previous iteration
>>>        * and regVal is the regularization value computed in the previous
>>> iteration as well.
>>>        */
>>>       stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
>>>       val update = updater.compute(
>>>         weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
>>> stepSize, i, regParam)
>>>       weights = update._1
>>>       regVal = update._2
>>>       timeStamp.append(System.nanoTime() - startTime)
>>>     }
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>>
>>>> I don't understand why sparse falls behind dense so much at the very
>>>> first iteration. I didn't see count() is called in
>>>>
>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
>>>> . Maybe you have local uncommitted changes.
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu> wrote:
>>>> > Hi Xiangrui,
>>>> >
>>>> > Yes, I'm using yarn-cluster mode, and I did check # of executors I
>>>> specified
>>>> > are the same as the actual running executors.
>>>> >
>>>> > For caching and materialization, I've the timer in optimizer after
>>>> calling
>>>> > count(); as a result, the time for materialization in cache isn't in
>>>> the
>>>> > benchmark.
>>>> >
>>>> > The difference you saw is actually from dense feature or sparse
>>>> feature
>>>> > vector. For LBFGS and GD dense feature, you can see the first
>>>> iteration
>>>> > takes the same time. It's true for GD.
>>>> >
>>>> > I'm going to run rcv1.binary which only has 0.15% non-zero elements to
>>>> > verify the hypothesis.
>>>> >
>>>> >
>>>> > Sincerely,
>>>> >
>>>> > DB Tsai
>>>> > -------------------------------------------------------
>>>> > My Blog: https://www.dbtsai.com
>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >
>>>> >
>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hi DB,
>>>> >>
>>>> >> I saw you are using yarn-cluster mode for the benchmark. I tested the
>>>> >> yarn-cluster mode and found that YARN does not always give you the
>>>> >> exact number of executors requested. Just want to confirm that you've
>>>> >> checked the number of executors.
>>>> >>
>>>> >> The second thing to check is that in the benchmark code, after you
>>>> >> call cache, you should also call count() to materialize the RDD. I
>>>> saw
>>>> >> in the result, the real difference is actually at the first step.
>>>> >> Adding intercept is not a cheap operation for sparse vectors.
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >>
>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <men...@gmail.com>
>>>> wrote:
>>>> >> > I don't think it is easy to make sparse faster than dense with this
>>>> >> > sparsity and feature dimension. You can try rcv1.binary, which
>>>> should
>>>> >> > show the difference easily.
>>>> >> >
>>>> >> > David, the breeze operators used here are
>>>> >> >
>>>> >> > 1. DenseVector dot SparseVector
>>>> >> > 2. axpy DenseVector SparseVector
>>>> >> >
>>>> >> > However, the SparseVector is passed in as Vector[Double] instead of
>>>> >> > SparseVector[Double]. It might use the axpy impl of [DenseVector,
>>>> >> > Vector] and call activeIterator. I didn't check whether you used
>>>> >> > multimethods on axpy.
>>>> >> >
>>>> >> > Best,
>>>> >> > Xiangrui
>>>> >> >
>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu>
>>>> wrote:
>>>> >> >> The figure showing the Log-Likelihood vs Time can be found here.
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>>> >> >>
>>>> >> >> Let me know if you can not open it. Thanks.
>>>> >> >>
>>>> >> >> Sincerely,
>>>> >> >>
>>>> >> >> DB Tsai
>>>> >> >> -------------------------------------------------------
>>>> >> >> My Blog: https://www.dbtsai.com
>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> >>
>>>> >> >>
>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>>>> >> >> <shiva...@eecs.berkeley.edu> wrote:
>>>> >> >>> I don't think the attachment came through in the list. Could you
>>>> >> >>> upload the
>>>> >> >>> results somewhere and link to them ?
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com>
>>>> wrote:
>>>> >> >>>>
>>>> >> >>>> 123 features per rows, and in average, 89% are zeros.
>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.spa...@gmail.com>
>>>> wrote:
>>>> >> >>>>
>>>> >> >>>> > What is the number of non zeroes per row (and number of
>>>> features)
>>>> >> >>>> > in the
>>>> >> >>>> > sparse case? We've hit some issues with breeze sparse support
>>>> in
>>>> >> >>>> > the
>>>> >> >>>> > past
>>>> >> >>>> > but for sufficiently sparse data it's still pretty good.
>>>> >> >>>> >
>>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu>
>>>> wrote:
>>>> >> >>>> > >
>>>> >> >>>> > > Hi all,
>>>> >> >>>> > >
>>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using the
>>>> newly
>>>> >> >>>> > > added
>>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and the
>>>> same
>>>> >> >>>> > methodology
>>>> >> >>>> > in this paper,
>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>>> >> >>>> > >
>>>> >> >>>> > > I want to know how Spark scale while adding workers, and how
>>>> >> >>>> > > optimizers
>>>> >> >>>> > and input format (sparse or dense) impact performance.
>>>> >> >>>> > >
>>>> >> >>>> > > The benchmark code can be found here,
>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>>> >> >>>> > >
>>>> >> >>>> > > The first dataset I benchmarked is a9a which only has
>>>> 2.2MB. I
>>>> >> >>>> > duplicated the dataset, and made it 762MB to have 11M rows.
>>>> This
>>>> >> >>>> > dataset
>>>> >> >>>> > has 123 features and 11% of the data are non-zero elements.
>>>> >> >>>> > >
>>>> >> >>>> > > In this benchmark, all the dataset is cached in memory.
>>>> >> >>>> > >
>>>> >> >>>> > > As we expect, LBFGS converges faster than GD, and at some
>>>> point,
>>>> >> >>>> > > no
>>>> >> >>>> > matter how we push GD, it will converge slower and slower.
>>>> >> >>>> > >
>>>> >> >>>> > > However, it's surprising that sparse format runs slower than
>>>> >> >>>> > > dense
>>>> >> >>>> > format. I did see that sparse format takes significantly
>>>> smaller
>>>> >> >>>> > amount
>>>> >> >>>> > of
>>>> >> >>>> > memory in caching RDD, but sparse is 40% slower than dense. I
>>>> think
>>>> >> >>>> > sparse
>>>> >> >>>> > should be fast since when we compute x wT, since x is sparse,
>>>> we
>>>> >> >>>> > can do
>>>> >> >>>> > it
>>>> >> >>>> > faster. I wonder if there is anything I'm doing wrong.
>>>> >> >>>> > >
>>>> >> >>>> > > The attachment is the benchmark result.
>>>> >> >>>> > >
>>>> >> >>>> > > Thanks.
>>>> >> >>>> > >
>>>> >> >>>> > > Sincerely,
>>>> >> >>>> > >
>>>> >> >>>> > > DB Tsai
>>>> >> >>>> > > -------------------------------------------------------
>>>> >> >>>> > > My Blog: https://www.dbtsai.com
>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> >>>> >
>>>> >> >>>
>>>> >> >>>
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

Reply via email to