Just doing a join is not an option? If you carefully manage your
partitioning then this can be pretty efficient (meaning no extra shuffle,
basically map-side join)
On Jan 13, 2016 2:30 PM, "Daniel Imberman" <daniel.imber...@gmail.com>
wrote:

> I'm looking for a way to send structures to pre-determined partitions so
> that
> they can be used by another RDD in a mapPartition.
>
> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
> indexes. The inverted index objects are quite large.
>
> My hope is to do a MapPartitions within the RDD of vectors where I can
> compare each vector to the inverted index. The issue is that I only NEED
> one
> inverted index object per partition (which would have the same key as the
> values within that partition).
>
>
> val vectors:RDD[(Int, SparseVector)]
>
> val invertedIndexes:RDD[(Int, InvIndex)] =
> a.reduceByKey(generateInvertedIndex)
> vectors:RDD.mapPartitions{
>     iter =>
>          val invIndex = invertedIndexes(samePartitionKey)
>          iter.map(invIndex.calculateSimilarity(_))
>          )
> }
>
> How could I go about setting up the Partition such that the specific data
> structure I need will be present for the mapPartition but I won't have the
> extra overhead of sending over all values (which would happen if I were to
> make a broadcast variable).
>
> One thought I have been having is to store the objects in HDFS but I'm not
> sure if that would be a suboptimal solution (It seems like it could slow
> down the process a lot)
>
> Another thought I am currently exploring is whether there is some way I can
> create a custom Partition or Partitioner that could hold the data structure
> (Although that might get too complicated and become problematic)
>
> Any thoughts on how I could attack this issue would be highly appreciated.
>
> thank you for your help!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to