Thanks you all. Just changing RDD to Map structure saved me approx. 1 second.
Yes, I will check out IndexedRDD to see if it has better performance. best, /Shahab On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz <brk...@gmail.com> wrote: > If your dataset is large, there is a Spark Package called IndexedRDD > optimized for lookups. Feel free to check that out. > > Burak > On Feb 19, 2015 7:37 AM, "Ilya Ganelin" <ilgan...@gmail.com> wrote: > >> Hi Shahab - if your data structures are small enough a broadcasted Map is >> going to provide faster lookup. Lookup within an RDD is an O(m) operation >> where m is the size of the partition. For RDDs with multiple partitions, >> executors can operate on it in parallel so you get some improvement for >> larger RDDs. >> On Thu, Feb 19, 2015 at 7:31 AM shahab <shahab.mok...@gmail.com> wrote: >> >>> Hi, >>> >>> I am doing lookup on cached RDDs [(Int,String)], and I noticed that the >>> lookup is relatively slow 30-100 ms ?? I even tried this on one machine >>> with single partition, but no difference! >>> >>> The RDDs are not large at all, 3-30 MB. >>> >>> Is this expected behaviour? should I use other data structures, like >>> HashMap to keep data and look up it there and use Broadcast to send a copy >>> to all machines? >>> >>> best, >>> /Shahab >>> >>> >>>