Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread andy petrella
NP man,

The thing is that since you're in a dist env, it'd be cumbersome to do
that. Remember that Spark works basically on block/partition, they are the
unit of distribution and parallelization.
That means that actions have to be run against it **after having been
scheduled on the cluster**.
The latter point is the most important, it means that the RDD aren't
really created on the driver the collection is created/transformed/... on
the partition.
Consequence of what you cannot, on the driver, create such representation
on the distributed collection since you haven't seen it yet.
That being said, you can only prepare/define some computations on the
driver that will segregate the data by applying a filter on the nodes.
If you want to keep RDD operators as they are, yes you'll need to pass over
the distributed data twice.

The option of using the mapPartitions for instance, will be to create a
RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might
have to repartition otherwise the OOMs might blow at your face :-D.
I won't pick that one!


A final note: looping over the data is not that a problem (specially if you
can cache it), and in fact it's way better to keep advantage of resilience
etc etc that comes with Spark.

my2c
andy


On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi Andy,  thanks for your response. I already thought about filtering
 twice, that was what I meant with that would be equivalent to applying
 filter twice, but I was thinking if I could do it in a single pass, so
 that could be later generalized to an arbitrary numbers of classes. I would
 also like to be able to generate RDDs instead of partitions of a single
 RDD, so I could use RDD methods like stats() on the fragments. But I think
 there is currently no RDD method that returns more than one RDD for a
 single input RDD, so maybe there is some design limitation on Spark that
 prevents this?

 Again, thanks for your answer.

 Greetings,

 Juan
 El 17/12/2014 18:15, andy petrella andy.petre...@gmail.com escribió:

 yo,

 First, here is the scala version:
 http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
 Boolean):(Repr,Repr)

 Second: RDD is distributed so what you'll have to do is to partition each
 partition each partition (:-D) or create two RDDs with by filtering twice →
 hence tasks will be scheduled distinctly, and data read twice. Choose
 what's best for you!

 hth,
 andy


 On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I would like to be able to split a RDD in two pieces according to a
 predicate. That would be equivalent to applying filter twice, with the
 predicate and its complement, which is also similar to Haskell's partition
 list function (
 http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
 There is currently any way to do this in Spark?, or maybe anyone has a
 suggestion about how to implent this by modifying the Spark source. I think
 this is valuable because sometimes I need to split a RDD in several groups
 that are too big to fit in the memory of a single thread, so pair RDDs are
 not solution for those cases. A generalization to n parts of Haskell's
 partition would do the job.

 Thanks a lot for your help.

 Greetings,

 Juan Rodriguez




Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread Juan Rodríguez Hortalá
Hi Andy,

Thanks again for your thoughts on this, I haven't found much information
about the internals of Spark, so I find very useful and interesting these
kind of explanations about its low level mechanisms. It's also nice to know
that the two pass approach is a viable solution.

Regards,

Juan

2014-12-18 11:10 GMT+01:00 andy petrella andy.petre...@gmail.com:

 NP man,

 The thing is that since you're in a dist env, it'd be cumbersome to do
 that. Remember that Spark works basically on block/partition, they are the
 unit of distribution and parallelization.
 That means that actions have to be run against it **after having been
 scheduled on the cluster**.
 The latter point is the most important, it means that the RDD aren't
 really created on the driver the collection is created/transformed/... on
 the partition.
 Consequence of what you cannot, on the driver, create such representation
 on the distributed collection since you haven't seen it yet.
 That being said, you can only prepare/define some computations on the
 driver that will segregate the data by applying a filter on the nodes.
 If you want to keep RDD operators as they are, yes you'll need to pass
 over the distributed data twice.

 The option of using the mapPartitions for instance, will be to create a
 RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might
 have to repartition otherwise the OOMs might blow at your face :-D.
 I won't pick that one!


 A final note: looping over the data is not that a problem (specially if
 you can cache it), and in fact it's way better to keep advantage of
 resilience etc etc that comes with Spark.

 my2c
 andy


 On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi Andy,  thanks for your response. I already thought about filtering
 twice, that was what I meant with that would be equivalent to applying
 filter twice, but I was thinking if I could do it in a single pass, so
 that could be later generalized to an arbitrary numbers of classes. I would
 also like to be able to generate RDDs instead of partitions of a single
 RDD, so I could use RDD methods like stats() on the fragments. But I think
 there is currently no RDD method that returns more than one RDD for a
 single input RDD, so maybe there is some design limitation on Spark that
 prevents this?

 Again, thanks for your answer.

 Greetings,

 Juan
 El 17/12/2014 18:15, andy petrella andy.petre...@gmail.com escribió:

 yo,

 First, here is the scala version:
 http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
 Boolean):(Repr,Repr)

 Second: RDD is distributed so what you'll have to do is to partition
 each partition each partition (:-D) or create two RDDs with by filtering
 twice → hence tasks will be scheduled distinctly, and data read twice.
 Choose what's best for you!

 hth,
 andy


 On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I would like to be able to split a RDD in two pieces according to a
 predicate. That would be equivalent to applying filter twice, with the
 predicate and its complement, which is also similar to Haskell's partition
 list function (
 http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
 There is currently any way to do this in Spark?, or maybe anyone has a
 suggestion about how to implent this by modifying the Spark source. I think
 this is valuable because sometimes I need to split a RDD in several groups
 that are too big to fit in the memory of a single thread, so pair RDDs are
 not solution for those cases. A generalization to n parts of Haskell's
 partition would do the job.

 Thanks a lot for your help.

 Greetings,

 Juan Rodriguez




Implementing a spark version of Haskell's partition

2014-12-17 Thread Juan Rodríguez Hortalá
Hi all,

I would like to be able to split a RDD in two pieces according to a
predicate. That would be equivalent to applying filter twice, with the
predicate and its complement, which is also similar to Haskell's partition
list function (
http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). There
is currently any way to do this in Spark?, or maybe anyone has a suggestion
about how to implent this by modifying the Spark source. I think this is
valuable because sometimes I need to split a RDD in several groups that are
too big to fit in the memory of a single thread, so pair RDDs are not
solution for those cases. A generalization to n parts of Haskell's
partition would do the job.

Thanks a lot for your help.

Greetings,

Juan Rodriguez


Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread andy petrella
yo,

First, here is the scala version:
http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
Boolean):(Repr,Repr)

Second: RDD is distributed so what you'll have to do is to partition each
partition each partition (:-D) or create two RDDs with by filtering twice →
hence tasks will be scheduled distinctly, and data read twice. Choose
what's best for you!

hth,
andy


On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I would like to be able to split a RDD in two pieces according to a
 predicate. That would be equivalent to applying filter twice, with the
 predicate and its complement, which is also similar to Haskell's partition
 list function (
 http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
 There is currently any way to do this in Spark?, or maybe anyone has a
 suggestion about how to implent this by modifying the Spark source. I think
 this is valuable because sometimes I need to split a RDD in several groups
 that are too big to fit in the memory of a single thread, so pair RDDs are
 not solution for those cases. A generalization to n parts of Haskell's
 partition would do the job.

 Thanks a lot for your help.

 Greetings,

 Juan Rodriguez



Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread Juan Rodríguez Hortalá
Hi Andy,  thanks for your response. I already thought about filtering
twice, that was what I meant with that would be equivalent to applying
filter twice, but I was thinking if I could do it in a single pass, so
that could be later generalized to an arbitrary numbers of classes. I would
also like to be able to generate RDDs instead of partitions of a single
RDD, so I could use RDD methods like stats() on the fragments. But I think
there is currently no RDD method that returns more than one RDD for a
single input RDD, so maybe there is some design limitation on Spark that
prevents this?

Again, thanks for your answer.

Greetings,

Juan
El 17/12/2014 18:15, andy petrella andy.petre...@gmail.com escribió:

 yo,

 First, here is the scala version:
 http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
 Boolean):(Repr,Repr)

 Second: RDD is distributed so what you'll have to do is to partition each
 partition each partition (:-D) or create two RDDs with by filtering twice →
 hence tasks will be scheduled distinctly, and data read twice. Choose
 what's best for you!

 hth,
 andy


 On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi all,

 I would like to be able to split a RDD in two pieces according to a
 predicate. That would be equivalent to applying filter twice, with the
 predicate and its complement, which is also similar to Haskell's partition
 list function (
 http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html).
 There is currently any way to do this in Spark?, or maybe anyone has a
 suggestion about how to implent this by modifying the Spark source. I think
 this is valuable because sometimes I need to split a RDD in several groups
 that are too big to fit in the memory of a single thread, so pair RDDs are
 not solution for those cases. A generalization to n parts of Haskell's
 partition would do the job.

 Thanks a lot for your help.

 Greetings,

 Juan Rodriguez