We have a very large RDD and I need to create a new RDD whose values are
derived from each record of the original RDD, and we only retain the few new
records that meet a criteria.  I want to avoid creating a second large RDD
and then filtering it since I believe this could tax system resources
unnecessarily (tell me if that assumption is wrong.)

So for example, /and this is just an example/, say we have an RDD with 1 to
1,000,000 and we iterate through each value, and compute it's md5 hash, and
we only keep the results that start with 'A'.

What we've tried and seems to work but which seemed a bit ugly, and perhaps
not efficient, was the following in pseudocode. * Is this the best way to do
this?*

Thanks

bigRdd.flatMap( { i =>
  val h = md5(i)
  if (h.substring(1,1) == 'A') {
    Array(h)
  } else {
    Array[String]()
  }
})



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to