[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

Ahmed Mahran (JIRA) Sat, 13 Aug 2016 04:59:41 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ahmed Mahran updated SPARK-14880:
---------------------------------
    Description: 
The current implementation of (Stochastic) Gradient Descent performs one 
map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
smaller, the algorithm becomes shuffle-bound instead of CPU-bound.

{code}
(1 to numIterations or convergence) {
 rdd
  .sample(fraction)
  .map(Gradient)
  .reduce(Update)
}
{code}

A more performant variation requires only one map-reduce regardless from the 
number of iterations. A local mini-batch SGD could be run on each partition, 
then the results could be averaged. This is based on (Zinkevich, Martin, Markus 
Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient 
descent." In Advances in neural information processing systems, 2010, 
http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).

{code}
rdd
 .shuffle()
 .mapPartitions((1 to numIterations or convergence) {
   iter.sample(fraction).map(Gradient).reduce(Update)
 })
 .reduce(Average)
{code}

A higher level iteration could enclose the above variation; shuffling the data 
before the local mini-batches and feeding back the average weights from the 
last iteration. This allows more variability in the sampling of the 
mini-batches with the possibility to cover the whole dataset. Here is a Spark 
based implementation 
https://github.com/mashin-io/rich-spark/blob/master/main/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala

{code}
(1 to numIterations1 or convergence) {
 rdd
  .shuffle()
  .mapPartitions((1 to numIterations2 or convergence) {
    iter.sample(fraction).map(Gradient).reduce(Update)
  })
  .reduce(Average)
}
{code}

  was:
The current implementation of (Stochastic) Gradient Descent performs one 
map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
smaller, the algorithm becomes shuffle-bound instead of CPU-bound.

{code}
(1 to numIterations or convergence) {
 rdd
  .sample(fraction)
  .map(Gradient)
  .reduce(Update)
}
{code}

A more performant variation requires only one map-reduce regardless from the 
number of iterations. A local mini-batch SGD could be run on each partition, 
then the results could be averaged. This is based on (Zinkevich, Martin, Markus 
Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient 
descent." In Advances in neural information processing systems, 2010, 
http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).

{code}
rdd
 .shuffle()
 .mapPartitions((1 to numIterations or convergence) {
   iter.sample(fraction).map(Gradient).reduce(Update)
 })
 .reduce(Average)
{code}

A higher level iteration could enclose the above variation; shuffling the data 
before the local mini-batches and feeding back the average weights from the 
last iteration. This allows more variability in the sampling of the 
mini-batches with the possibility to cover the whole dataset. Here is a Spark 
based implementation 
https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala

{code}
(1 to numIterations1 or convergence) {
 rdd
  .shuffle()
  .mapPartitions((1 to numIterations2 or convergence) {
    iter.sample(fraction).map(Gradient).reduce(Update)
  })
  .reduce(Average)
}
{code}


> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---------------------------------------------------------------
>
>                 Key: SPARK-14880
>                 URL: https://issues.apache.org/jira/browse/SPARK-14880
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Ahmed Mahran
>              Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>    iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows more variability in the sampling of the 
> mini-batches with the possibility to cover the whole dataset. Here is a Spark 
> based implementation 
> https://github.com/mashin-io/rich-spark/blob/master/main/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
>     iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

Reply via email to