NP man,
The thing is that since you're in a dist env, it'd be cumbersome to do
that. Remember that Spark works basically on block/partition, they are the
unit of distribution and parallelization.
That means that actions have to be run against it **after having been
scheduled on the cluster**.
The
Hi Andy,
Thanks again for your thoughts on this, I haven't found much information
about the internals of Spark, so I find very useful and interesting these
kind of explanations about its low level mechanisms. It's also nice to know
that the two pass approach is a viable solution.
Regards,
Juan
yo,
First, here is the scala version:
http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A=
Boolean):(Repr,Repr)
Second: RDD is distributed so what you'll have to do is to partition each
partition each partition (:-D) or create two RDDs with by filtering twice →
Hi Andy, thanks for your response. I already thought about filtering
twice, that was what I meant with that would be equivalent to applying
filter twice, but I was thinking if I could do it in a single pass, so
that could be later generalized to an arbitrary numbers of classes. I would
also like