Thanks you all. Just changing RDD to Map  structure saved me approx. 1
second.

Yes, I will check out IndexedRDD to see if it has better performance.

best,
/Shahab

On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz <brk...@gmail.com> wrote:

> If your dataset is large, there is a Spark Package called IndexedRDD
> optimized for lookups. Feel free to check that out.
>
> Burak
> On Feb 19, 2015 7:37 AM, "Ilya Ganelin" <ilgan...@gmail.com> wrote:
>
>> Hi Shahab - if your data structures are small enough a broadcasted Map is
>> going to provide faster lookup. Lookup within an RDD is an O(m) operation
>> where m is the size of the partition. For RDDs with multiple partitions,
>> executors can operate on it in parallel so you get some improvement for
>> larger RDDs.
>> On Thu, Feb 19, 2015 at 7:31 AM shahab <shahab.mok...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
>>> lookup is relatively slow 30-100 ms ?? I even tried this on one machine
>>> with single partition, but no difference!
>>>
>>> The RDDs are not large at all, 3-30 MB.
>>>
>>> Is this expected behaviour? should I use other data structures, like
>>> HashMap to keep data and look up it there and use Broadcast to send a copy
>>> to all machines?
>>>
>>> best,
>>> /Shahab
>>>
>>>
>>>

Reply via email to