I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD in a mapPartition.
Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large. My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition (which would have the same key as the values within that partition). val vectors:RDD[(Int, SparseVector)] val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex) vectors:RDD.mapPartitions{ iter => val invIndex = invertedIndexes(samePartitionKey) iter.map(invIndex.calculateSimilarity(_)) ) } How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable). One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution (It seems like it could slow down the process a lot) Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic) Any thoughts on how I could attack this issue would be highly appreciated. thank you for your help! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org