PairRDDFunctions.lookup is good enough in Spark, it's just that its time complexity is O(N). Of course, for RDDs equipped with a partitioner, N is the average size of a partition.
On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <and...@andrewash.com> wrote: > If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() method > on it for faster access. > > > http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions > > Spark's strength is running computations across a large set of data. If > you're trying to do fast lookup of a few individual keys, I'd recommend > something more like memcached or Elasticsearch. > > > On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel <manojsamelt...@gmail.com>wrote: > >> Yes, that works. >> >> But then the hashmap functionality of the fast key lookup etc. is gone >> and the search will be linear using a iterator etc. Not sure if Spark >> internally creates additional optimizations for Seq but otherwise one has >> to assume this becomes a List/Array without a fast key lookup of a hashmap >> or b-tree >> >> Any thoughts ? >> >> >> >> >> >> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft < >> fnoth...@berkeley.edu> wrote: >> >>> Manoj, >>> >>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t you >>> just do: >>> >>> val cr_rdd = sc.parallelize(cr.toSeq) >>> >>> The toSeq would convert the HashMap[String,Double] into a Seq[(String, >>> Double)] before calling the parallelize function. >>> >>> Regards, >>> >>> Frank Austin Nothaft >>> fnoth...@berkeley.edu >>> fnoth...@eecs.berkeley.edu >>> 202-340-0466 >>> >>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <manojsamelt...@gmail.com> >>> wrote: >>> >>> > Is there a way to create RDD over a hashmap ? >>> > >>> > If I have a hash map and try sc.parallelize, it gives >>> > >>> > <console>:17: error: type mismatch; >>> > found : scala.collection.mutable.HashMap[String,Double] >>> > required: Seq[?] >>> > Error occurred in an application involving default arguments. >>> > val cr_rdd = sc.parallelize(cr) >>> > ^ >>> >>> >> >