Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
So I tried this: .mapPartitions(itr = { itr.grouped(300).flatMap(items = { myFunction(items) }) }) and I tried this: .mapPartitions(itr = { itr.grouped(300).flatMap(myFunction) }) I tried making myFunction a method, a function val, and even moving it into a singleton

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
The more I'm thinking about this- I may try this instead: val myChunkedRDD: RDD[List[Event]] = inputRDD.mapPartitions(_ .grouped(300).toList) I wonder if this would work. I'll try it when I get back to work tomorrow. Yuyhao, I tried your approach too but it seems to be somehow moving all the

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
No, only each group should need to fit. On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped =

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize) for (group - grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet cjno...@gmail.com wrote: I think the word partition here is a tad different than the term partition that we use in Spark. Basically, I want

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize) for (group - grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet

RE: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Yang, Yuhao
Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala It can be used through sliding(windowSize: Int) in spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala Yuhao From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Thursday, February 12, 2015