I don't think you can avoid examining each element of the RDD, if
that's what you mean. Your approach is basically the best you can do
in general. You're not making a second RDD here, and even if you did
this in two steps, the second RDD is really more of a bookkeeping that
a second huge data structure.

You can simplify your example a bit, although I doubt it's noticeably faster:

bigRdd.flatMap { i =>
  val h = md5(i)
  if (h(0) == 'A') {
    Some(h)
  } else {
    None
  }
}

This is also fine, simpler still, and if it's slower, not by much:

bigRdd.map(md5).filter(_(0) == 'A')


On Thu, Dec 18, 2014 at 10:18 PM, bethesda <swearinge...@mac.com> wrote:
> We have a very large RDD and I need to create a new RDD whose values are
> derived from each record of the original RDD, and we only retain the few new
> records that meet a criteria.  I want to avoid creating a second large RDD
> and then filtering it since I believe this could tax system resources
> unnecessarily (tell me if that assumption is wrong.)
>
> So for example, /and this is just an example/, say we have an RDD with 1 to
> 1,000,000 and we iterate through each value, and compute it's md5 hash, and
> we only keep the results that start with 'A'.
>
> What we've tried and seems to work but which seemed a bit ugly, and perhaps
> not efficient, was the following in pseudocode. * Is this the best way to do
> this?*
>
> Thanks
>
> bigRdd.flatMap( { i =>
>   val h = md5(i)
>   if (h.substring(1,1) == 'A') {
>     Array(h)
>   } else {
>     Array[String]()
>   }
> })
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to