Hi Koert,

So I actually just mentioned something somewhat similar in the thread (your
email actually came through as I was sending it :) ).

One question I have is if I do a groupByKey and I have been smart about my
partitioning up to this point would I have that benefit of not needing to
shuffle the data? The only issue I have with doing a join without using
something like a groupByKey is that I could end up with multiple copies of
the inverted index (or is Spark smart enough to store one value for the
InvInd and simply have all associated values refer to the same object?)

Best,

Daniel

On Sat, Jan 16, 2016 at 9:38 AM Koert Kuipers <ko...@tresata.com> wrote:

> Just doing a join is not an option? If you carefully manage your
> partitioning then this can be pretty efficient (meaning no extra shuffle,
> basically map-side join)
> On Jan 13, 2016 2:30 PM, "Daniel Imberman" <daniel.imber...@gmail.com>
> wrote:
>
>> I'm looking for a way to send structures to pre-determined partitions so
>> that
>> they can be used by another RDD in a mapPartition.
>>
>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>> indexes. The inverted index objects are quite large.
>>
>> My hope is to do a MapPartitions within the RDD of vectors where I can
>> compare each vector to the inverted index. The issue is that I only NEED
>> one
>> inverted index object per partition (which would have the same key as the
>> values within that partition).
>>
>>
>> val vectors:RDD[(Int, SparseVector)]
>>
>> val invertedIndexes:RDD[(Int, InvIndex)] =
>> a.reduceByKey(generateInvertedIndex)
>> vectors:RDD.mapPartitions{
>>     iter =>
>>          val invIndex = invertedIndexes(samePartitionKey)
>>          iter.map(invIndex.calculateSimilarity(_))
>>          )
>> }
>>
>> How could I go about setting up the Partition such that the specific data
>> structure I need will be present for the mapPartition but I won't have the
>> extra overhead of sending over all values (which would happen if I were to
>> make a broadcast variable).
>>
>> One thought I have been having is to store the objects in HDFS but I'm not
>> sure if that would be a suboptimal solution (It seems like it could slow
>> down the process a lot)
>>
>> Another thought I am currently exploring is whether there is some way I
>> can
>> create a custom Partition or Partitioner that could hold the data
>> structure
>> (Although that might get too complicated and become problematic)
>>
>> Any thoughts on how I could attack this issue would be highly appreciated.
>>
>> thank you for your help!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to