If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out.
Burak On Feb 19, 2015 7:37 AM, "Ilya Ganelin" <ilgan...@gmail.com> wrote: > Hi Shahab - if your data structures are small enough a broadcasted Map is > going to provide faster lookup. Lookup within an RDD is an O(m) operation > where m is the size of the partition. For RDDs with multiple partitions, > executors can operate on it in parallel so you get some improvement for > larger RDDs. > On Thu, Feb 19, 2015 at 7:31 AM shahab <shahab.mok...@gmail.com> wrote: > >> Hi, >> >> I am doing lookup on cached RDDs [(Int,String)], and I noticed that the >> lookup is relatively slow 30-100 ms ?? I even tried this on one machine >> with single partition, but no difference! >> >> The RDDs are not large at all, 3-30 MB. >> >> Is this expected behaviour? should I use other data structures, like >> HashMap to keep data and look up it there and use Broadcast to send a copy >> to all machines? >> >> best, >> /Shahab >> >> >>