Re: what is the best way to implement mini batches?

2014-12-15 Thread Imran Rashid
I'm a little confused by some of the responses.  It seems like there are
two different issues being discussed here:

1.  How to turn a sequential algorithm into something that works on spark.
Eg deal with the fact that data is split into partitions which are
processed in parallel (though within a partition, data is processed
sequentially).  I'm guessing folks are particularly interested in online
machine learning algos, which often have a point update and a mini batch
update.

2.  How to convert a one-point-at-a-time view of the data and convert it
into a mini batches view of the data.

(2) is pretty straightforward, eg with iterator.grouped (batchSize), or
manually put data into your own buffer etc.  This works for creating mini
batches *within* one partition in the context of spark.

But problem (1) is completely separate, and there is no general solution.
It really depends the specifics of what you're trying to do.

Some of the suggestions on this thread seem like they are basically just
falling back to sequential data processing ... but reay inefficient
sequential processing.  Eg.  It doesn't make sense to do a full scan of
your data with spark, and ignore all the records but the few that are in
the next mini batch.

It's completely reasonable to just sequentially process all the data if
that works for you.  But then it doesn't make sense to use spark, you're
not gaining anything from it.

Hope this helps, apologies if I just misunderstood the other suggested
solutions.
On Dec 14, 2014 8:35 PM, Earthson earthson...@gmail.com wrote:

 I think it could be done like:

 1. using mapPartition to randomly drop some partition
 2. drop some elements randomly(for selected partition)
 3. calculate gradient step for selected elements

 I don't think fixed step is needed, but fixed step could be done:

 1. zipWithIndex
 2. create ShuffleRDD based on the index(eg. using index/10 as key)
 3. using mapPartition to calculate each bach

 I also have a question:

 Can mini batches run in parallel?
 I think parallel all batches just like a full batch GD in some case.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20677.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: what is the best way to implement mini batches?

2014-12-15 Thread Earthson Lu
Hi Imran, you are right. Sequentially process does not make sense to use spark.

I think Sequentially process works if batch for each iteration is large 
enough(this batch could be processed in parallel).

My point is that we shall not run mini-batches in parallel, but it still 
possible to use large batch for parallel inside each batch(It seems to be the 
way that SGD implemented in MLLib does?).


-- 
Earthson Lu

On December 16, 2014 at 04:02:22, Imran Rashid (im...@therashids.com) wrote:

I'm a little confused by some of the responses.  It seems like there are two 
different issues being discussed here:

1.  How to turn a sequential algorithm into something that works on spark.  Eg 
deal with the fact that data is split into partitions which are processed in 
parallel (though within a partition, data is processed sequentially).  I'm 
guessing folks are particularly interested in online machine learning algos, 
which often have a point update and a mini batch update.

2.  How to convert a one-point-at-a-time view of the data and convert it into a 
mini batches view of the data.

(2) is pretty straightforward, eg with iterator.grouped (batchSize), or 
manually put data into your own buffer etc.  This works for creating mini 
batches *within* one partition in the context of spark.

But problem (1) is completely separate, and there is no general solution.  It 
really depends the specifics of what you're trying to do.

Some of the suggestions on this thread seem like they are basically just 
falling back to sequential data processing ... but reay inefficient 
sequential processing.  Eg.  It doesn't make sense to do a full scan of your 
data with spark, and ignore all the records but the few that are in the next 
mini batch.

It's completely reasonable to just sequentially process all the data if that 
works for you.  But then it doesn't make sense to use spark, you're not gaining 
anything from it.

Hope this helps, apologies if I just misunderstood the other suggested 
solutions.

On Dec 14, 2014 8:35 PM, Earthson earthson...@gmail.com wrote:
I think it could be done like:

1. using mapPartition to randomly drop some partition
2. drop some elements randomly(for selected partition)
3. calculate gradient step for selected elements

I don't think fixed step is needed, but fixed step could be done:

1. zipWithIndex
2. create ShuffleRDD based on the index(eg. using index/10 as key)
3. using mapPartition to calculate each bach

I also have a question:

Can mini batches run in parallel?
I think parallel all batches just like a full batch GD in some case.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: what is the best way to implement mini batches?

2014-12-14 Thread Earthson
I think it could be done like:

1. using mapPartition to randomly drop some partition
2. drop some elements randomly(for selected partition)
3. calculate gradient step for selected elements

I don't think fixed step is needed, but fixed step could be done:

1. zipWithIndex
2. create ShuffleRDD based on the index(eg. using index/10 as key)
3. using mapPartition to calculate each bach

I also have a question: 

Can mini batches run in parallel?
I think parallel all batches just like a full batch GD in some case. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: what is the best way to implement mini batches?

2014-12-11 Thread ll
any advice/comment on this would be much appreciated.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on 
the iterator in each one to get a sliding window. One problem is that you will 
not be able to slide forward into the next partition at partition boundaries. 
If this matters to you, you need to do something more complicated to get those, 
such as the repartition that you said (where you map each record to the 
partition it should be in).

Matei

 On Dec 11, 2014, at 10:16 AM, ll duy.huynh@gmail.com wrote:
 
 any advice/comment on this would be much appreciated.  
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: what is the best way to implement mini batches?

2014-12-11 Thread Duy Huynh
the dataset i'm working on has about 100,000 records.  the batch that we're
training on has a size around 10.  can you repartition(10,000) into 10,000
partitions?

On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 You can just do mapPartitions on the whole RDD, and then called sliding()
 on the iterator in each one to get a sliding window. One problem is that
 you will not be able to slide forward into the next partition at
 partition boundaries. If this matters to you, you need to do something more
 complicated to get those, such as the repartition that you said (where you
 map each record to the partition it should be in).

 Matei

  On Dec 11, 2014, at 10:16 AM, ll duy.huynh@gmail.com wrote:
 
  any advice/comment on this would be much appreciated.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




Re: what is the best way to implement mini batches?

2014-12-11 Thread Imran Rashid
Minor correction:  I think you want iterator.grouped(10) for
non-overlapping mini batches
On Dec 11, 2014 1:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 You can just do mapPartitions on the whole RDD, and then called sliding()
 on the iterator in each one to get a sliding window. One problem is that
 you will not be able to slide forward into the next partition at
 partition boundaries. If this matters to you, you need to do something more
 complicated to get those, such as the repartition that you said (where you
 map each record to the partition it should be in).

 Matei

  On Dec 11, 2014, at 10:16 AM, ll duy.huynh@gmail.com wrote:
 
  any advice/comment on this would be much appreciated.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: what is the best way to implement mini batches?

2014-12-11 Thread Ilya Ganelin
Hi all. I've been working on a similar problem. One solution that is
straightforward (if suboptimal) is to do the following.

A.zipWithIndex().filter(_._2 =range_start  _._2  range_end). Lastly
just put that in a for loop. I've found that this approach scales very
well.

As Matei said another option is to define a custom partitioner and then use
mapPartitions. Hope that helps!


On Thu, Dec 11, 2014 at 6:16 PM Imran Rashid im...@therashids.com wrote:

 Minor correction:  I think you want iterator.grouped(10) for
 non-overlapping mini batches
 On Dec 11, 2014 1:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 You can just do mapPartitions on the whole RDD, and then called sliding()
 on the iterator in each one to get a sliding window. One problem is that
 you will not be able to slide forward into the next partition at
 partition boundaries. If this matters to you, you need to do something more
 complicated to get those, such as the repartition that you said (where you
 map each record to the partition it should be in).

 Matei

  On Dec 11, 2014, at 10:16 AM, ll duy.huynh@gmail.com wrote:
 
  any advice/comment on this would be much appreciated.
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini
 -batches-tp20264p20635.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: what is the best way to implement mini batches?

2014-12-03 Thread Alex Minnaar
I am trying to do the same thing and also wondering what the best strategy is.

Thanks


From: ll duy.huynh@gmail.com
Sent: Wednesday, December 3, 2014 10:28 AM
To: u...@spark.incubator.apache.org
Subject: what is the best way to implement mini batches?

hi.  what is the best way to pass through a large dataset in small,
sequential mini batches?

for example, with 1,000,000 data points and the mini batch size is 10,  we
would need to do some computation at these mini batches (0..9), (10..19),
(20..29), ... (N-9, N)

RDD.repartition(N/10).mapPartitions() work?

thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org