PairRDDFunctions.lookup is good enough in Spark, it's just that its time
complexity is O(N).  Of course, for RDDs equipped with a partitioner, N is
the average size of a partition.


On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <and...@andrewash.com> wrote:

> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() method
> on it for faster access.
>
>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
>
> Spark's strength is running computations across a large set of data.  If
> you're trying to do fast lookup of a few individual keys, I'd recommend
> something more like memcached or Elasticsearch.
>
>
> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <manojsamelt...@gmail.com>wrote:
>
>> Yes, that works.
>>
>> But then the hashmap functionality of the fast key lookup etc. is gone
>> and the search will be linear using a iterator etc. Not sure if Spark
>> internally creates additional optimizations for Seq but otherwise one has
>> to assume this becomes a List/Array without a fast key lookup of a hashmap
>> or b-tree
>>
>> Any thoughts ?
>>
>>
>>
>>
>>
>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft <
>> fnoth...@berkeley.edu> wrote:
>>
>>> Manoj,
>>>
>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you
>>> just do:
>>>
>>> val cr_rdd = sc.parallelize(cr.toSeq)
>>>
>>> The toSeq would convert the HashMap[String,Double] into a Seq[(String,
>>> Double)] before calling the parallelize function.
>>>
>>> Regards,
>>>
>>> Frank Austin Nothaft
>>> fnoth...@berkeley.edu
>>> fnoth...@eecs.berkeley.edu
>>> 202-340-0466
>>>
>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <manojsamelt...@gmail.com>
>>> wrote:
>>>
>>> > Is there a way to create RDD over a hashmap ?
>>> >
>>> > If I have a hash map and try sc.parallelize, it gives
>>> >
>>> > <console>:17: error: type mismatch;
>>> >  found   : scala.collection.mutable.HashMap[String,Double]
>>> >  required: Seq[?]
>>> > Error occurred in an application involving default arguments.
>>> >        val cr_rdd = sc.parallelize(cr)
>>> >                                    ^
>>>
>>>
>>
>

Reply via email to