Hi Andy, Thanks again for your thoughts on this, I haven't found much information about the internals of Spark, so I find very useful and interesting these kind of explanations about its low level mechanisms. It's also nice to know that the two pass approach is a viable solution.
Regards, Juan 2014-12-18 11:10 GMT+01:00 andy petrella <andy.petre...@gmail.com>: > > NP man, > > The thing is that since you're in a dist env, it'd be cumbersome to do > that. Remember that Spark works basically on block/partition, they are the > unit of distribution and parallelization. > That means that actions have to be run against it **after having been > scheduled on the cluster**. > The latter point is the most important, it means that the RDD aren't > "really" created on the driver the collection is created/transformed/... on > the partition. > Consequence of what you cannot, on the driver, create such representation > on the distributed collection since you haven't seen it yet. > That being said, you can only prepare/define some computations on the > driver that will segregate the data by applying a filter on the nodes. > If you want to keep RDD operators as they are, yes you'll need to pass > over the distributed data twice. > > The option of using the mapPartitions for instance, will be to create a > RDD[Seq[A], Seq[A]] however it's going to be tricky because you'll might > have to repartition otherwise the OOMs might blow at your face :-D. > I won't pick that one! > > > A final note: looping over the data is not that a problem (specially if > you can cache it), and in fact it's way better to keep advantage of > resilience etc etc that comes with Spark. > > my2c > andy > > > On Wed Dec 17 2014 at 7:07:05 PM Juan Rodríguez Hortalá < > juan.rodriguez.hort...@gmail.com> wrote: > >> Hi Andy, thanks for your response. I already thought about filtering >> twice, that was what I meant with "that would be equivalent to applying >> filter twice", but I was thinking if I could do it in a single pass, so >> that could be later generalized to an arbitrary numbers of classes. I would >> also like to be able to generate RDDs instead of partitions of a single >> RDD, so I could use RDD methods like stats() on the fragments. But I think >> there is currently no RDD method that returns more than one RDD for a >> single input RDD, so maybe there is some design limitation on Spark that >> prevents this? >> >> Again, thanks for your answer. >> >> Greetings, >> >> Juan >> El 17/12/2014 18:15, "andy petrella" <andy.petre...@gmail.com> escribió: >> >> yo, >>> >>> First, here is the scala version: >>> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= >>> >Boolean):(Repr,Repr) >>> >>> Second: RDD is distributed so what you'll have to do is to partition >>> each partition each partition (:-D) or create two RDDs with by filtering >>> twice → hence tasks will be scheduled distinctly, and data read twice. >>> Choose what's best for you! >>> >>> hth, >>> andy >>> >>> >>> On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá < >>> juan.rodriguez.hort...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I would like to be able to split a RDD in two pieces according to a >>>> predicate. That would be equivalent to applying filter twice, with the >>>> predicate and its complement, which is also similar to Haskell's partition >>>> list function ( >>>> http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). >>>> There is currently any way to do this in Spark?, or maybe anyone has a >>>> suggestion about how to implent this by modifying the Spark source. I think >>>> this is valuable because sometimes I need to split a RDD in several groups >>>> that are too big to fit in the memory of a single thread, so pair RDDs are >>>> not solution for those cases. A generalization to n parts of Haskell's >>>> partition would do the job. >>>> >>>> Thanks a lot for your help. >>>> >>>> Greetings, >>>> >>>> Juan Rodriguez >>>> >>>