Hi Andy,

Thanks again for your thoughts on this, I haven't found much information
about the internals of Spark, so I find very useful and interesting these
kind of explanations about its low level mechanisms. It's also nice to know
that the two pass approach is a viable solution.

Regards,

Juan

2014-12-18 11:10 GMT+01:00 andy petrella <andy.petre...@gmail.com>:
>
> NP man,
>
> The thing is that since you're in a dist env, it'd be cumbersome to do
> that. Remember that Spark works basically on block/partition, they are the
> unit of distribution and parallelization.
> That means that actions have to be run against it **after having been
> scheduled on the cluster**.
> The latter point is the most important, it means that the RDD aren't
> "really" created on the driver the collection is created/transformed/... on
> the partition.
> Consequence of what you cannot, on the driver, create such representation
> on the distributed collection since you haven't seen it yet.
> That being said, you can only prepare/define some computations on the
> driver that will segregate the data by applying a filter on the nodes.
> If you want to keep RDD operators as they are, yes you'll need to pass
> over the distributed data twice.
>
> The option of using the mapPartitions for instance, will be to create a
> RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might
> have to repartition otherwise the OOMs might blow at your face :-D.
> I won't pick that one!
>
>
> A final note: looping over the data is not that a problem (specially if
> you can cache it), and in fact it's way better to keep advantage of
> resilience etc etc that comes with Spark.
>
> my2c
> andy
>
>
> On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
>> Hi Andy,  thanks for your response. I already thought about filtering
>> twice, that was what I meant with "that would be equivalent to applying
>> filter twice", but I was thinking if I could do it in a single pass, so
>> that could be later generalized to an arbitrary numbers of classes. I would
>> also like to be able to generate RDDs instead of partitions of a single
>> RDD, so I could use RDD methods like stats() on the fragments. But I think
>> there is currently no RDD method that returns more than one RDD for a
>> single input RDD, so maybe there is some design limitation on Spark that
>> prevents this?
>>
>> Again, thanks for your answer.
>>
>> Greetings,
>>
>> Juan
>> El 17/12/2014 18:15, "andy petrella" <andy.petre...@gmail.com> escribió:
>>
>> yo,
>>>
>>> First, here is the scala version:
>>> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
>>> >Boolean):(Repr,Repr)
>>>
>>> Second: RDD is distributed so what you'll have to do is to partition
>>> each partition each partition (:-D) or create two RDDs with by filtering
>>> twice → hence tasks will be scheduled distinctly, and data read twice.
>>> Choose what's best for you!
>>>
>>> hth,
>>> andy
>>>
>>>
>>> On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to be able to split a RDD in two pieces according to a
>>>> predicate. That would be equivalent to applying filter twice, with the
>>>> predicate and its complement, which is also similar to Haskell's partition
>>>> list function (
>>>> http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
>>>> There is currently any way to do this in Spark?, or maybe anyone has a
>>>> suggestion about how to implent this by modifying the Spark source. I think
>>>> this is valuable because sometimes I need to split a RDD in several groups
>>>> that are too big to fit in the memory of a single thread, so pair RDDs are
>>>> not solution for those cases. A generalization to n parts of Haskell's
>>>> partition would do the job.
>>>>
>>>> Thanks a lot for your help.
>>>>
>>>> Greetings,
>>>>
>>>> Juan Rodriguez
>>>>
>>>

Reply via email to