Thanks to all suggestions, I am able to make progress on it. Manoj
On Fri, Jan 24, 2014 at 1:54 PM, Tathagata Das <tathagata.das1...@gmail.com>wrote: > On this note, you can do something smarter that the basic lookup function. > You could convert each partition of the key-value pair RDD into a hashmap > using something like > > val rddOfHashmaps = pairRDD.mapPartitions(iterator => { > val hashmap = new HashMap[String, ArrayBuffer[Double]] > iterator.foreach { case (key, value} => hashmap.getOrElseUpdate(key, > new ArrayBuffer[Double]) += value > Iterator(hashmap) > }, preserveParitioning = true) > > And then you can do a variation of the lookup > function<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549>to > lookup the right partition, and then within that partition directly > lookup the hashmap and return the value (rather than scanning the whole > partition). That give practically O(1) lookup time instead of O(N). But i > doubt it will match something that a dedicated lookup system like memcached > would achieve. > > TD > > > > > On Fri, Jan 24, 2014 at 1:36 PM, Andrew Ash <and...@andrewash.com> wrote: > >> By my reading of the code, it uses the partitioner to decide which worker >> the key lands on, then does an O(N) scan of that partition. I think we're >> saying the same thing. >> >> >> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L549 >> >> >> On Fri, Jan 24, 2014 at 1:26 PM, Cheng Lian <rhythm.m...@gmail.com>wrote: >> >>> PairRDDFunctions.lookup is good enough in Spark, it's just that its time >>> complexity is O(N). Of course, for RDDs equipped with a partitioner, N is >>> the average size of a partition. >>> >>> >>> On Sat, Jan 25, 2014 at 5:16 AM, Andrew Ash <and...@andrewash.com>wrote: >>> >>>> If you have a pair RDD (an RDD[A,B]) then you can use the .lookup() >>>> method on it for faster access. >>>> >>>> >>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions >>>> >>>> Spark's strength is running computations across a large set of data. >>>> If you're trying to do fast lookup of a few individual keys, I'd recommend >>>> something more like memcached or Elasticsearch. >>>> >>>> >>>> On Fri, Jan 24, 2014 at 1:11 PM, Manoj Samel >>>> <manojsamelt...@gmail.com>wrote: >>>> >>>>> Yes, that works. >>>>> >>>>> But then the hashmap functionality of the fast key lookup etc. is gone >>>>> and the search will be linear using a iterator etc. Not sure if Spark >>>>> internally creates additional optimizations for Seq but otherwise one has >>>>> to assume this becomes a List/Array without a fast key lookup of a hashmap >>>>> or b-tree >>>>> >>>>> Any thoughts ? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jan 24, 2014 at 1:00 PM, Frank Austin Nothaft < >>>>> fnoth...@berkeley.edu> wrote: >>>>> >>>>>> Manoj, >>>>>> >>>>>> I assume you’re trying to create an RDD[(String, Double)]? Couldn’t >>>>>> you just do: >>>>>> >>>>>> val cr_rdd = sc.parallelize(cr.toSeq) >>>>>> >>>>>> The toSeq would convert the HashMap[String,Double] into a >>>>>> Seq[(String, Double)] before calling the parallelize function. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Frank Austin Nothaft >>>>>> fnoth...@berkeley.edu >>>>>> fnoth...@eecs.berkeley.edu >>>>>> 202-340-0466 >>>>>> >>>>>> On Jan 24, 2014, at 12:56 PM, Manoj Samel <manojsamelt...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> > Is there a way to create RDD over a hashmap ? >>>>>> > >>>>>> > If I have a hash map and try sc.parallelize, it gives >>>>>> > >>>>>> > <console>:17: error: type mismatch; >>>>>> > found : scala.collection.mutable.HashMap[String,Double] >>>>>> > required: Seq[?] >>>>>> > Error occurred in an application involving default arguments. >>>>>> > val cr_rdd = sc.parallelize(cr) >>>>>> > ^ >>>>>> >>>>>> >>>>> >>>> >>> >> >