Thank you Ted! That sounds like it would probably be the most efficient (with the least overhead) way of handling this situation.
On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <yuzhih...@gmail.com> wrote: > Another approach is to store the objects in NoSQL store such as HBase. > > Looking up object should be very fast. > > Cheers > > On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> I'm looking for a way to send structures to pre-determined partitions so >> that >> they can be used by another RDD in a mapPartition. >> >> Essentially I'm given and RDD of SparseVectors and an RDD of inverted >> indexes. The inverted index objects are quite large. >> >> My hope is to do a MapPartitions within the RDD of vectors where I can >> compare each vector to the inverted index. The issue is that I only NEED >> one >> inverted index object per partition (which would have the same key as the >> values within that partition). >> >> >> val vectors:RDD[(Int, SparseVector)] >> >> val invertedIndexes:RDD[(Int, InvIndex)] = >> a.reduceByKey(generateInvertedIndex) >> vectors:RDD.mapPartitions{ >> iter => >> val invIndex = invertedIndexes(samePartitionKey) >> iter.map(invIndex.calculateSimilarity(_)) >> ) >> } >> >> How could I go about setting up the Partition such that the specific data >> structure I need will be present for the mapPartition but I won't have the >> extra overhead of sending over all values (which would happen if I were to >> make a broadcast variable). >> >> One thought I have been having is to store the objects in HDFS but I'm not >> sure if that would be a suboptimal solution (It seems like it could slow >> down the process a lot) >> >> Another thought I am currently exploring is whether there is some way I >> can >> create a custom Partition or Partitioner that could hold the data >> structure >> (Although that might get too complicated and become problematic) >> >> Any thoughts on how I could attack this issue would be highly appreciated. >> >> thank you for your help! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >