Hi Xiangrui,

Thanks for explanation, but I'm still missing something. In my experiments, if 
miniBatchFraction == 1.0, no matter how the data is partitioned (2, 4, 8, 16 
partitions), the algorithm executes more or less in the same time. (I have 16 
Workers). Reduce from runMiniBatchSGD takes most of the time for 2 partitions, 
mapPartitionWithIndex -- for 16. What I would expect is that the time reduces 
proportional to the number of data partitions because each partition will be 
processed on separate Worker hopefully. Why the time does not reduce?

Btw processing of one instance in my algorithm is a heavy computation, this is 
exact reason why I want to parallelize it.

Best regards, Alexander

26.08.2014, в 20:54, "Xiangrui Meng" 
<men...@gmail.com<mailto:men...@gmail.com>> написал(а):

miniBatchFraction uses RDD.sample to get the mini-batch, and sample
still needs to visit the elements one after another. So it is not
efficient if the task is not computation heavy and this is why
setMiniBatchFraction is marked as experimental. If we can detect that
the partition iterator is backed by an ArrayBuffer, maybe we can do a
skip iterator to skip elements. -Xiangrui

On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Hi, RJ

https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala

Unit tests are in the same branch.

Alexander

From: RJ Nowling [mailto:rnowl...@gmail.com]
Sent: Tuesday, August 26, 2014 6:59 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Gradient descent and runMiniBatchSGD

Hi Alexander,

Can you post a link to the code?

RJ

On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com>>
 wrote:
Hi,

I've implemented back propagation algorithm using Gradient class and a simple 
update using Updater class. Then I run the algorithm with mllib's 
GradientDescent class. I have troubles in scaling out this implementation. I 
thought that if I partition my data into the number of workers then performance 
will increase, because each worker will run a step of gradient descent on its 
partition of data. But this does not happen and each worker seems to process 
all data (if miniBatchFraction == 1.0 as in mllib's logisic regression 
implementation). For me, this doesn't make sense, because then only single 
Worker will provide the same performance. Could someone elaborate on this and 
correct me if I am wrong. How can I scale out the algorithm with many Workers?

Best regards, Alexander



--
em rnowl...@gmail.com<mailto:rnowl...@gmail.com><mailto:rnowl...@gmail.com>
c 954.496.2314

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to