If your dataset is large, there is a Spark Package called IndexedRDD
optimized for lookups. Feel free to check that out.

Burak
On Feb 19, 2015 7:37 AM, "Ilya Ganelin" <ilgan...@gmail.com> wrote:

> Hi Shahab - if your data structures are small enough a broadcasted Map is
> going to provide faster lookup. Lookup within an RDD is an O(m) operation
> where m is the size of the partition. For RDDs with multiple partitions,
> executors can operate on it in parallel so you get some improvement for
> larger RDDs.
> On Thu, Feb 19, 2015 at 7:31 AM shahab <shahab.mok...@gmail.com> wrote:
>
>> Hi,
>>
>> I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
>> lookup is relatively slow 30-100 ms ?? I even tried this on one machine
>> with single partition, but no difference!
>>
>> The RDDs are not large at all, 3-30 MB.
>>
>> Is this expected behaviour? should I use other data structures, like
>> HashMap to keep data and look up it there and use Broadcast to send a copy
>> to all machines?
>>
>> best,
>> /Shahab
>>
>>
>>

Reply via email to