Re: Implementing a spark version of Haskell's partition
NP man, The thing is that since you're in a dist env, it'd be cumbersome to do that. Remember that Spark works basically on block/partition, they are the unit of distribution and parallelization. That means that actions have to be run against it **after having been scheduled on the cluster**. The latter point is the most important, it means that the RDD aren't really created on the driver the collection is created/transformed/... on the partition. Consequence of what you cannot, on the driver, create such representation on the distributed collection since you haven't seen it yet. That being said, you can only prepare/define some computations on the driver that will segregate the data by applying a filter on the nodes. If you want to keep RDD operators as they are, yes you'll need to pass over the distributed data twice. The option of using the mapPartitions for instance, will be to create a RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might have to repartition otherwise the OOMs might blow at your face :-D. I won't pick that one! A final note: looping over the data is not that a problem (specially if you can cache it), and in fact it's way better to keep advantage of resilience etc etc that comes with Spark. my2c andy On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi Andy, thanks for your response. I already thought about filtering twice, that was what I meant with that would be equivalent to applying filter twice, but I was thinking if I could do it in a single pass, so that could be later generalized to an arbitrary numbers of classes. I would also like to be able to generate RDDs instead of partitions of a single RDD, so I could use RDD methods like stats() on the fragments. But I think there is currently no RDD method that returns more than one RDD for a single input RDD, so maybe there is some design limitation on Spark that prevents this? Again, thanks for your answer. Greetings, Juan El 17/12/2014 18:15, andy petrella andy.petre...@gmail.com escribió: yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= Boolean):(Repr,Repr) Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice → hence tasks will be scheduled distinctly, and data read twice. Choose what's best for you! hth, andy On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi all, I would like to be able to split a RDD in two pieces according to a predicate. That would be equivalent to applying filter twice, with the predicate and its complement, which is also similar to Haskell's partition list function ( http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). There is currently any way to do this in Spark?, or maybe anyone has a suggestion about how to implent this by modifying the Spark source. I think this is valuable because sometimes I need to split a RDD in several groups that are too big to fit in the memory of a single thread, so pair RDDs are not solution for those cases. A generalization to n parts of Haskell's partition would do the job. Thanks a lot for your help. Greetings, Juan Rodriguez
Re: Implementing a spark version of Haskell's partition
Hi Andy, Thanks again for your thoughts on this, I haven't found much information about the internals of Spark, so I find very useful and interesting these kind of explanations about its low level mechanisms. It's also nice to know that the two pass approach is a viable solution. Regards, Juan 2014-12-18 11:10 GMT+01:00 andy petrella andy.petre...@gmail.com: NP man, The thing is that since you're in a dist env, it'd be cumbersome to do that. Remember that Spark works basically on block/partition, they are the unit of distribution and parallelization. That means that actions have to be run against it **after having been scheduled on the cluster**. The latter point is the most important, it means that the RDD aren't really created on the driver the collection is created/transformed/... on the partition. Consequence of what you cannot, on the driver, create such representation on the distributed collection since you haven't seen it yet. That being said, you can only prepare/define some computations on the driver that will segregate the data by applying a filter on the nodes. If you want to keep RDD operators as they are, yes you'll need to pass over the distributed data twice. The option of using the mapPartitions for instance, will be to create a RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might have to repartition otherwise the OOMs might blow at your face :-D. I won't pick that one! A final note: looping over the data is not that a problem (specially if you can cache it), and in fact it's way better to keep advantage of resilience etc etc that comes with Spark. my2c andy On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi Andy, thanks for your response. I already thought about filtering twice, that was what I meant with that would be equivalent to applying filter twice, but I was thinking if I could do it in a single pass, so that could be later generalized to an arbitrary numbers of classes. I would also like to be able to generate RDDs instead of partitions of a single RDD, so I could use RDD methods like stats() on the fragments. But I think there is currently no RDD method that returns more than one RDD for a single input RDD, so maybe there is some design limitation on Spark that prevents this? Again, thanks for your answer. Greetings, Juan El 17/12/2014 18:15, andy petrella andy.petre...@gmail.com escribió: yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= Boolean):(Repr,Repr) Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice → hence tasks will be scheduled distinctly, and data read twice. Choose what's best for you! hth, andy On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi all, I would like to be able to split a RDD in two pieces according to a predicate. That would be equivalent to applying filter twice, with the predicate and its complement, which is also similar to Haskell's partition list function ( http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). There is currently any way to do this in Spark?, or maybe anyone has a suggestion about how to implent this by modifying the Spark source. I think this is valuable because sometimes I need to split a RDD in several groups that are too big to fit in the memory of a single thread, so pair RDDs are not solution for those cases. A generalization to n parts of Haskell's partition would do the job. Thanks a lot for your help. Greetings, Juan Rodriguez
Implementing a spark version of Haskell's partition
Hi all, I would like to be able to split a RDD in two pieces according to a predicate. That would be equivalent to applying filter twice, with the predicate and its complement, which is also similar to Haskell's partition list function ( http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). There is currently any way to do this in Spark?, or maybe anyone has a suggestion about how to implent this by modifying the Spark source. I think this is valuable because sometimes I need to split a RDD in several groups that are too big to fit in the memory of a single thread, so pair RDDs are not solution for those cases. A generalization to n parts of Haskell's partition would do the job. Thanks a lot for your help. Greetings, Juan Rodriguez
Re: Implementing a spark version of Haskell's partition
yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= Boolean):(Repr,Repr) Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice → hence tasks will be scheduled distinctly, and data read twice. Choose what's best for you! hth, andy On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi all, I would like to be able to split a RDD in two pieces according to a predicate. That would be equivalent to applying filter twice, with the predicate and its complement, which is also similar to Haskell's partition list function ( http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). There is currently any way to do this in Spark?, or maybe anyone has a suggestion about how to implent this by modifying the Spark source. I think this is valuable because sometimes I need to split a RDD in several groups that are too big to fit in the memory of a single thread, so pair RDDs are not solution for those cases. A generalization to n parts of Haskell's partition would do the job. Thanks a lot for your help. Greetings, Juan Rodriguez
Re: Implementing a spark version of Haskell's partition
Hi Andy, thanks for your response. I already thought about filtering twice, that was what I meant with that would be equivalent to applying filter twice, but I was thinking if I could do it in a single pass, so that could be later generalized to an arbitrary numbers of classes. I would also like to be able to generate RDDs instead of partitions of a single RDD, so I could use RDD methods like stats() on the fragments. But I think there is currently no RDD method that returns more than one RDD for a single input RDD, so maybe there is some design limitation on Spark that prevents this? Again, thanks for your answer. Greetings, Juan El 17/12/2014 18:15, andy petrella andy.petre...@gmail.com escribió: yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= Boolean):(Repr,Repr) Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice → hence tasks will be scheduled distinctly, and data read twice. Choose what's best for you! hth, andy On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi all, I would like to be able to split a RDD in two pieces according to a predicate. That would be equivalent to applying filter twice, with the predicate and its complement, which is also similar to Haskell's partition list function ( http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). There is currently any way to do this in Spark?, or maybe anyone has a suggestion about how to implent this by modifying the Spark source. I think this is valuable because sometimes I need to split a RDD in several groups that are too big to fit in the memory of a single thread, so pair RDDs are not solution for those cases. A generalization to n parts of Haskell's partition would do the job. Thanks a lot for your help. Greetings, Juan Rodriguez