Re: Split RDD based on criteria

Chad Urso McDaniel Wed, 10 Jun 2015 09:36:20 -0700

While it does feel like a filter is what you want to do, a common way to
handle this is to map to different keys.


Using your rddList example it becomes like this (scala style):
---
val rddSplit: RDD[(Int, Any)]  = rdd.map(x => (*createKey*(x), x))
val rddBuckets: RDD[(Int, Iterable[Any])] = rddSplit.groupByKey
---

You write *createKey* to do the equivalent work as your filters & then you
have a single RDD with your buckets.


On Wed, Jun 10, 2015 at 5:56 AM dgoldenberg <dgoldenberg...@gmail.com>
wrote:

> Hi,
>
> I'm gathering that the typical approach for splitting an RDD is to apply
> several filters to it.
>
> rdd1 = rdd.filter(func1);
> rdd2 = rdd.filter(func2);
> ...
>
> Is there/should there be a way to create 'buckets' like these in one go?
>
> List<RDD> rddList = rdd.filter(func1, func2, ..., funcN)
>
> Another angle here is, when applying a filter(func), is there a way to get
> two RDD's back, one for which func returned true for all elements of the
> original RDD (the one being filtered), and the other one for which func
> returned false for all the elements?
>
> Pair<RDD> pair = rdd.filterTrueFalse(func);
>
> Right now I'm doing
>
> RDD x = rdd.filter(func);
> RDD y = rdd.filter(reverseOfFunc);
>
> This seems a bit tautological to me, though Spark must be optimizing this
> out (?)
>
> Thanks.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Split-RDD-based-on-criteria-tp23254.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Split RDD based on criteria

Reply via email to