We have a very large RDD and I need to create a new RDD whose values are derived from each record of the original RDD, and we only retain the few new records that meet a criteria. I want to avoid creating a second large RDD and then filtering it since I believe this could tax system resources unnecessarily (tell me if that assumption is wrong.)
So for example, /and this is just an example/, say we have an RDD with 1 to 1,000,000 and we iterate through each value, and compute it's md5 hash, and we only keep the results that start with 'A'. What we've tried and seems to work but which seemed a bit ugly, and perhaps not efficient, was the following in pseudocode. * Is this the best way to do this?* Thanks bigRdd.flatMap( { i => val h = md5(i) if (h.substring(1,1) == 'A') { Array(h) } else { Array[String]() } }) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org