I don't think you can avoid examining each element of the RDD, if
that's what you mean. Your approach is basically the best you can do
in general. You're not making a second RDD here, and even if you did
this in two steps, the second RDD is really more of a bookkeeping that
a second huge data structure.
You can simplify your example a bit, although I doubt it's noticeably faster:
bigRdd.flatMap { i =
val h = md5(i)
if (h(0) == 'A') {
Some(h)
} else {
None
}
}
This is also fine, simpler still, and if it's slower, not by much:
bigRdd.map(md5).filter(_(0) == 'A')
On Thu, Dec 18, 2014 at 10:18 PM, bethesda swearinge...@mac.com wrote:
We have a very large RDD and I need to create a new RDD whose values are
derived from each record of the original RDD, and we only retain the few new
records that meet a criteria. I want to avoid creating a second large RDD
and then filtering it since I believe this could tax system resources
unnecessarily (tell me if that assumption is wrong.)
So for example, /and this is just an example/, say we have an RDD with 1 to
1,000,000 and we iterate through each value, and compute it's md5 hash, and
we only keep the results that start with 'A'.
What we've tried and seems to work but which seemed a bit ugly, and perhaps
not efficient, was the following in pseudocode. * Is this the best way to do
this?*
Thanks
bigRdd.flatMap( { i =
val h = md5(i)
if (h.substring(1,1) == 'A') {
Array(h)
} else {
Array[String]()
}
})
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org