groupBy seems to be exactly what you want.

val data = sc.parallelize(1 to 200)
data.groupBy(_ % 10).values.map(...)

This would let you process 10 Iterable[Int] in parallel, each of which
is 20 ints in this example.

It may not make sense to do this in practice, as you'd be shuffling a
lot of data around just to make the chunks. If you want to map chunks
of data at once, look at mapPartitions(), which will tend to respect
data locality.

groupBy returns an RDD -- looks like ShuffledRDD actually but may
depend on what comes before. It shouldn't matter though; it's an RDD
and that's what you need, not an Iterable.

On Tue, Aug 19, 2014 at 9:02 PM, TJ Klein <tjkl...@gmail.com> wrote:
> Hi,
>
> is there a way such that I can group items in an RDD together such that I
> can process them using parallelize/map
>
> Let's say I have data items with keys 1...1000 e.g.
> loading RDD = sc. newAPIHadoopFile(...).cache()
>
> Now, I would like them to be processed in chunks of e.g. tens
> chunk1=[0..9],chunk2=[10..19],...,chunk100=[991..999]
>
> sc.parallelize([chunk1,....,chunk100]).map(process my chunk)
>
> I thought I could use groupBy() or something like that but the return-type
> is PipelinedRDD, which is not iterable.
> Anybody an idea?
> Thanks in advance,
>  Tassilo
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to