Just doing a join is not an option? If you carefully manage your partitioning then this can be pretty efficient (meaning no extra shuffle, basically map-side join) On Jan 13, 2016 2:30 PM, "Daniel Imberman" <daniel.imber...@gmail.com> wrote:
> I'm looking for a way to send structures to pre-determined partitions so > that > they can be used by another RDD in a mapPartition. > > Essentially I'm given and RDD of SparseVectors and an RDD of inverted > indexes. The inverted index objects are quite large. > > My hope is to do a MapPartitions within the RDD of vectors where I can > compare each vector to the inverted index. The issue is that I only NEED > one > inverted index object per partition (which would have the same key as the > values within that partition). > > > val vectors:RDD[(Int, SparseVector)] > > val invertedIndexes:RDD[(Int, InvIndex)] = > a.reduceByKey(generateInvertedIndex) > vectors:RDD.mapPartitions{ > iter => > val invIndex = invertedIndexes(samePartitionKey) > iter.map(invIndex.calculateSimilarity(_)) > ) > } > > How could I go about setting up the Partition such that the specific data > structure I need will be present for the mapPartition but I won't have the > extra overhead of sending over all values (which would happen if I were to > make a broadcast variable). > > One thought I have been having is to store the objects in HDFS but I'm not > sure if that would be a suboptimal solution (It seems like it could slow > down the process a lot) > > Another thought I am currently exploring is whether there is some way I can > create a custom Partition or Partitioner that could hold the data structure > (Although that might get too complicated and become problematic) > > Any thoughts on how I could attack this issue would be highly appreciated. > > thank you for your help! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >