I'm looking for a way to send structures to pre-determined partitions so that
they can be used by another RDD in a mapPartition.

Essentially I'm given and RDD of SparseVectors and an RDD of inverted
indexes. The inverted index objects are quite large.

My hope is to do a MapPartitions within the RDD of vectors where I can
compare each vector to the inverted index. The issue is that I only NEED one
inverted index object per partition (which would have the same key as the
values within that partition).


val vectors:RDD[(Int, SparseVector)]

val invertedIndexes:RDD[(Int, InvIndex)] =
a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
    iter =>
         val invIndex = invertedIndexes(samePartitionKey)
         iter.map(invIndex.calculateSimilarity(_))
         ) 
}

How could I go about setting up the Partition such that the specific data
structure I need will be present for the mapPartition but I won't have the
extra overhead of sending over all values (which would happen if I were to
make a broadcast variable).

One thought I have been having is to store the objects in HDFS but I'm not
sure if that would be a suboptimal solution (It seems like it could slow
down the process a lot)

Another thought I am currently exploring is whether there is some way I can
create a custom Partition or Partitioner that could hold the data structure
(Although that might get too complicated and become problematic)

Any thoughts on how I could attack this issue would be highly appreciated.

thank you for your help!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to