[ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257297#comment-15257297
 ] 

Ahmed Mahran commented on SPARK-14880:
--------------------------------------

I think Zinkevich's requires the loss to be convex regardless from whether it 
is smooth or not. In their evaluation, they used a Huber loss which I think is 
not smooth.

I'd like to highlight that this algorithm is different from Zinkevich's in two 
things:
 - It uses mini-batch SGD instead of strict SGD
 - It applies higher level iterations

I don't have theoretical evidence about the effect of both modifications on 
convergence. However, they seem plausible for the following reasons. In less 
technical terms, the trick is to guarantee that the parallel partitions 
converge to closer limits as possible. Imagine a bunch of climbers, one on each 
partition, climbing similar hills starting from similar points with the same 
rate and steps in similar directions; they would eventually end at similar 
limits. The following seem to be logically plausible guarantees:
 - Using the same initialization, the same step size per iteration and number 
of iterations
 - Using mini-batches with the same sampling distribution reduces stochasticity
 - Averaging and reiterating resynchronizes the possibly deviated climbers to 
the same point
 - Reshuffling helps producing new samples to learn from

I would be interested to submit it as a Spark package. I'd also be interested 
in carrying out experiments, suggestions would be much appreciated.

> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---------------------------------------------------------------
>
>                 Key: SPARK-14880
>                 URL: https://issues.apache.org/jira/browse/SPARK-14880
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Ahmed Mahran
>              Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>    iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows more variability in the sampling of the 
> mini-batches with the possibility to cover the whole dataset. Here is a Spark 
> based implementation 
> https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
>     iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to