Re: Implementing a spark version of Haskell's partition

andy petrella Thu, 18 Dec 2014 02:21:03 -0800

NP man,

The thing is that since you're in a dist env, it'd be cumbersome to do
that. Remember that Spark works basically on block/partition, they are the
unit of distribution and parallelization.
That means that actions have to be run against it **after having been
scheduled on the cluster**.
The latter point is the most important, it means that the RDD aren't
"really" created on the driver the collection is created/transformed/... on
the partition.
Consequence of what you cannot, on the driver, create such representation
on the distributed collection since you haven't seen it yet.
That being said, you can only prepare/define some computations on the
driver that will segregate the data by applying a filter on the nodes.
If you want to keep RDD operators as they are, yes you'll need to pass over
the distributed data twice.


The option of using the mapPartitions for instance, will be to create a
RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might
have to repartition otherwise the OOMs might blow at your face :-D.
I won't pick that one!


A final note: looping over the data is not that a problem (specially if you
can cache it), and in fact it's way better to keep advantage of resilience
etc etc that comes with Spark.

my2c
andy


On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com> wrote:

> Hi Andy,  thanks for your response. I already thought about filtering
> twice, that was what I meant with "that would be equivalent to applying
> filter twice", but I was thinking if I could do it in a single pass, so
> that could be later generalized to an arbitrary numbers of classes. I would
> also like to be able to generate RDDs instead of partitions of a single
> RDD, so I could use RDD methods like stats() on the fragments. But I think
> there is currently no RDD method that returns more than one RDD for a
> single input RDD, so maybe there is some design limitation on Spark that
> prevents this?
>
> Again, thanks for your answer.
>
> Greetings,
>
> Juan
> El 17/12/2014 18:15, "andy petrella" <andy.petre...@gmail.com> escribió:
>
> yo,
>>
>> First, here is the scala version:
>> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
>> >Boolean):(Repr,Repr)
>>
>> Second: RDD is distributed so what you'll have to do is to partition each
>> partition each partition (:-D) or create two RDDs with by filtering twice →
>> hence tasks will be scheduled distinctly, and data read twice. Choose
>> what's best for you!
>>
>> hth,
>> andy
>>
>>
>> On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I would like to be able to split a RDD in two pieces according to a
>>> predicate. That would be equivalent to applying filter twice, with the
>>> predicate and its complement, which is also similar to Haskell's partition
>>> list function (
>>> http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
>>> There is currently any way to do this in Spark?, or maybe anyone has a
>>> suggestion about how to implent this by modifying the Spark source. I think
>>> this is valuable because sometimes I need to split a RDD in several groups
>>> that are too big to fit in the memory of a single thread, so pair RDDs are
>>> not solution for those cases. A generalization to n parts of Haskell's
>>> partition would do the job.
>>>
>>> Thanks a lot for your help.
>>>
>>> Greetings,
>>>
>>> Juan Rodriguez
>>>
>>

Re: Implementing a spark version of Haskell's partition

Reply via email to