Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread andy petrella
NP man, The thing is that since you're in a dist env, it'd be cumbersome to do that. Remember that Spark works basically on block/partition, they are the unit of distribution and parallelization. That means that actions have to be run against it **after having been scheduled on the cluster**. The

Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread Juan Rodríguez Hortalá
Hi Andy, Thanks again for your thoughts on this, I haven't found much information about the internals of Spark, so I find very useful and interesting these kind of explanations about its low level mechanisms. It's also nice to know that the two pass approach is a viable solution. Regards, Juan

Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread andy petrella
yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= Boolean):(Repr,Repr) Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice →

Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread Juan Rodríguez Hortalá
Hi Andy, thanks for your response. I already thought about filtering twice, that was what I meant with that would be equivalent to applying filter twice, but I was thinking if I could do it in a single pass, so that could be later generalized to an arbitrary numbers of classes. I would also like