Thank you Ted! That sounds like it would probably be the most efficient
(with the least overhead) way of handling this situation.

On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <yuzhih...@gmail.com> wrote:

> Another approach is to store the objects in NoSQL store such as HBase.
>
> Looking up object should be very fast.
>
> Cheers
>
> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
>
>> I'm looking for a way to send structures to pre-determined partitions so
>> that
>> they can be used by another RDD in a mapPartition.
>>
>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>> indexes. The inverted index objects are quite large.
>>
>> My hope is to do a MapPartitions within the RDD of vectors where I can
>> compare each vector to the inverted index. The issue is that I only NEED
>> one
>> inverted index object per partition (which would have the same key as the
>> values within that partition).
>>
>>
>> val vectors:RDD[(Int, SparseVector)]
>>
>> val invertedIndexes:RDD[(Int, InvIndex)] =
>> a.reduceByKey(generateInvertedIndex)
>> vectors:RDD.mapPartitions{
>>     iter =>
>>          val invIndex = invertedIndexes(samePartitionKey)
>>          iter.map(invIndex.calculateSimilarity(_))
>>          )
>> }
>>
>> How could I go about setting up the Partition such that the specific data
>> structure I need will be present for the mapPartition but I won't have the
>> extra overhead of sending over all values (which would happen if I were to
>> make a broadcast variable).
>>
>> One thought I have been having is to store the objects in HDFS but I'm not
>> sure if that would be a suboptimal solution (It seems like it could slow
>> down the process a lot)
>>
>> Another thought I am currently exploring is whether there is some way I
>> can
>> create a custom Partition or Partitioner that could hold the data
>> structure
>> (Although that might get too complicated and become problematic)
>>
>> Any thoughts on how I could attack this issue would be highly appreciated.
>>
>> thank you for your help!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to