Re: Sending large objects to specific RDDs

2016-01-17 Thread Daniel Imberman
This is perfect. So I guess my best course of action will be to create a
custom partitioner to assure that the smallest amount of data is shuffled
when I join the partitions, and then I really only need to do a map (rather
than a mapPartitions) since the inverted index object will be pointed to
(rather than copied for each value as I had assumed).

Thank you Ted and Koert!

On Sat, Jan 16, 2016 at 1:37 PM Ted Yu  wrote:

> Both groupByKey and join() accept Partitioner as parameter.
>
> Maybe you can specify a custom Partitioner so that the amount of shuffle
> is reduced.
>
> On Sat, Jan 16, 2016 at 9:39 AM, Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> I think I might have figured something out!(Though I haven't tested it at
>> scale yet)
>>
>> My current thought is that I can do a groupByKey on the RDD of vectors
>> and then do a join with the invertedIndex.
>> It would look something like this:
>>
>> val InvIndexes:RDD[(Int,InvertedIndex)]
>> val partitionedVectors:RDD[(Int, Vector)]
>>
>> val partitionedTasks:RDD[(Int, (Iterator[Vector], InvertedIndex))] =
>> partitionedvectors.groupByKey().join(invIndexes)
>>
>> val similarities = partitionedTasks.map(//calculate similarities)
>> val maxSim = similarities.reduce(math.max)
>>
>>
>> So while I realize that usually a groupByKey is usually frowned upon, it
>> seems to me that since I need all associated vectors to be local anyways
>> that this repartitioning would not be too expensive.
>>
>> Does this seem like a reasonable approach to this problem or are there
>> any faults that I should consider should I approach it this way?
>>
>> Thank you for your help,
>>
>> Daniel
>>
>> On Fri, Jan 15, 2016 at 5:30 PM Ted Yu  wrote:
>>
>>> My knowledge of XSEDE is limited - I visited the website.
>>>
>>> If there is no easy way to deploy HBase, alternative approach (using
>>> hdfs ?) needs to be considered.
>>>
>>> I need to do more homework on this :-)
>>>
>>> On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman <
>>> daniel.imber...@gmail.com> wrote:
>>>
 Hi Ted,

 So unfortunately after looking into the cluster manager that I will be
 using for my testing (I'm using a super-computer called XSEDE rather than
 AWS), it looks like the cluster does not actually come with Hbase installed
 (this cluster is becoming somewhat problematic, as it is essentially AWS
 but you have to do your own virtualization scripts). Do you have any other
 thoughts on how I could go about dealing with this purely using spark and
 HDFS?

 Thank you

 On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <
 daniel.imber...@gmail.com> wrote:

> Thank you Ted! That sounds like it would probably be the most
> efficient (with the least overhead) way of handling this situation.
>
> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu  wrote:
>
>> Another approach is to store the objects in NoSQL store such as HBase.
>>
>> Looking up object should be very fast.
>>
>> Cheers
>>
>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> I'm looking for a way to send structures to pre-determined
>>> partitions so that
>>> they can be used by another RDD in a mapPartition.
>>>
>>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>>> indexes. The inverted index objects are quite large.
>>>
>>> My hope is to do a MapPartitions within the RDD of vectors where I
>>> can
>>> compare each vector to the inverted index. The issue is that I only
>>> NEED one
>>> inverted index object per partition (which would have the same key
>>> as the
>>> values within that partition).
>>>
>>>
>>> val vectors:RDD[(Int, SparseVector)]
>>>
>>> val invertedIndexes:RDD[(Int, InvIndex)] =
>>> a.reduceByKey(generateInvertedIndex)
>>> vectors:RDD.mapPartitions{
>>> iter =>
>>>  val invIndex = invertedIndexes(samePartitionKey)
>>>  iter.map(invIndex.calculateSimilarity(_))
>>>  )
>>> }
>>>
>>> How could I go about setting up the Partition such that the specific
>>> data
>>> structure I need will be present for the mapPartition but I won't
>>> have the
>>> extra overhead of sending over all values (which would happen if I
>>> were to
>>> make a broadcast variable).
>>>
>>> One thought I have been having is to store the objects in HDFS but
>>> I'm not
>>> sure if that would be a suboptimal solution (It seems like it could
>>> slow
>>> down the process a lot)
>>>
>>> Another thought I am currently exploring is whether there is some
>>> way I can
>>> create a custom Partition or Partitioner that could hold the data
>>> structure
>>> (Although that might get too complicated and become 

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman
Hi Koert,

So I actually just mentioned something somewhat similar in the thread (your
email actually came through as I was sending it :) ).

One question I have is if I do a groupByKey and I have been smart about my
partitioning up to this point would I have that benefit of not needing to
shuffle the data? The only issue I have with doing a join without using
something like a groupByKey is that I could end up with multiple copies of
the inverted index (or is Spark smart enough to store one value for the
InvInd and simply have all associated values refer to the same object?)

Best,

Daniel

On Sat, Jan 16, 2016 at 9:38 AM Koert Kuipers  wrote:

> Just doing a join is not an option? If you carefully manage your
> partitioning then this can be pretty efficient (meaning no extra shuffle,
> basically map-side join)
> On Jan 13, 2016 2:30 PM, "Daniel Imberman" 
> wrote:
>
>> I'm looking for a way to send structures to pre-determined partitions so
>> that
>> they can be used by another RDD in a mapPartition.
>>
>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>> indexes. The inverted index objects are quite large.
>>
>> My hope is to do a MapPartitions within the RDD of vectors where I can
>> compare each vector to the inverted index. The issue is that I only NEED
>> one
>> inverted index object per partition (which would have the same key as the
>> values within that partition).
>>
>>
>> val vectors:RDD[(Int, SparseVector)]
>>
>> val invertedIndexes:RDD[(Int, InvIndex)] =
>> a.reduceByKey(generateInvertedIndex)
>> vectors:RDD.mapPartitions{
>> iter =>
>>  val invIndex = invertedIndexes(samePartitionKey)
>>  iter.map(invIndex.calculateSimilarity(_))
>>  )
>> }
>>
>> How could I go about setting up the Partition such that the specific data
>> structure I need will be present for the mapPartition but I won't have the
>> extra overhead of sending over all values (which would happen if I were to
>> make a broadcast variable).
>>
>> One thought I have been having is to store the objects in HDFS but I'm not
>> sure if that would be a suboptimal solution (It seems like it could slow
>> down the process a lot)
>>
>> Another thought I am currently exploring is whether there is some way I
>> can
>> create a custom Partition or Partitioner that could hold the data
>> structure
>> (Although that might get too complicated and become problematic)
>>
>> Any thoughts on how I could attack this issue would be highly appreciated.
>>
>> thank you for your help!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Sending large objects to specific RDDs

2016-01-16 Thread Ted Yu
Both groupByKey and join() accept Partitioner as parameter.

Maybe you can specify a custom Partitioner so that the amount of shuffle is
reduced.

On Sat, Jan 16, 2016 at 9:39 AM, Daniel Imberman 
wrote:

> Hi Ted,
>
> I think I might have figured something out!(Though I haven't tested it at
> scale yet)
>
> My current thought is that I can do a groupByKey on the RDD of vectors and
> then do a join with the invertedIndex.
> It would look something like this:
>
> val InvIndexes:RDD[(Int,InvertedIndex)]
> val partitionedVectors:RDD[(Int, Vector)]
>
> val partitionedTasks:RDD[(Int, (Iterator[Vector], InvertedIndex))] =
> partitionedvectors.groupByKey().join(invIndexes)
>
> val similarities = partitionedTasks.map(//calculate similarities)
> val maxSim = similarities.reduce(math.max)
>
>
> So while I realize that usually a groupByKey is usually frowned upon, it
> seems to me that since I need all associated vectors to be local anyways
> that this repartitioning would not be too expensive.
>
> Does this seem like a reasonable approach to this problem or are there any
> faults that I should consider should I approach it this way?
>
> Thank you for your help,
>
> Daniel
>
> On Fri, Jan 15, 2016 at 5:30 PM Ted Yu  wrote:
>
>> My knowledge of XSEDE is limited - I visited the website.
>>
>> If there is no easy way to deploy HBase, alternative approach (using hdfs
>> ?) needs to be considered.
>>
>> I need to do more homework on this :-)
>>
>> On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> Hi Ted,
>>>
>>> So unfortunately after looking into the cluster manager that I will be
>>> using for my testing (I'm using a super-computer called XSEDE rather than
>>> AWS), it looks like the cluster does not actually come with Hbase installed
>>> (this cluster is becoming somewhat problematic, as it is essentially AWS
>>> but you have to do your own virtualization scripts). Do you have any other
>>> thoughts on how I could go about dealing with this purely using spark and
>>> HDFS?
>>>
>>> Thank you
>>>
>>> On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <
>>> daniel.imber...@gmail.com> wrote:
>>>
 Thank you Ted! That sounds like it would probably be the most efficient
 (with the least overhead) way of handling this situation.

 On Wed, Jan 13, 2016 at 11:36 AM Ted Yu  wrote:

> Another approach is to store the objects in NoSQL store such as HBase.
>
> Looking up object should be very fast.
>
> Cheers
>
> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
>
>> I'm looking for a way to send structures to pre-determined partitions
>> so that
>> they can be used by another RDD in a mapPartition.
>>
>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>> indexes. The inverted index objects are quite large.
>>
>> My hope is to do a MapPartitions within the RDD of vectors where I can
>> compare each vector to the inverted index. The issue is that I only
>> NEED one
>> inverted index object per partition (which would have the same key as
>> the
>> values within that partition).
>>
>>
>> val vectors:RDD[(Int, SparseVector)]
>>
>> val invertedIndexes:RDD[(Int, InvIndex)] =
>> a.reduceByKey(generateInvertedIndex)
>> vectors:RDD.mapPartitions{
>> iter =>
>>  val invIndex = invertedIndexes(samePartitionKey)
>>  iter.map(invIndex.calculateSimilarity(_))
>>  )
>> }
>>
>> How could I go about setting up the Partition such that the specific
>> data
>> structure I need will be present for the mapPartition but I won't
>> have the
>> extra overhead of sending over all values (which would happen if I
>> were to
>> make a broadcast variable).
>>
>> One thought I have been having is to store the objects in HDFS but
>> I'm not
>> sure if that would be a suboptimal solution (It seems like it could
>> slow
>> down the process a lot)
>>
>> Another thought I am currently exploring is whether there is some way
>> I can
>> create a custom Partition or Partitioner that could hold the data
>> structure
>> (Although that might get too complicated and become problematic)
>>
>> Any thoughts on how I could attack this issue would be highly
>> appreciated.
>>
>> thank you for your help!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional 

Re: Sending large objects to specific RDDs

2016-01-15 Thread Ted Yu
My knowledge of XSEDE is limited - I visited the website.

If there is no easy way to deploy HBase, alternative approach (using hdfs
?) needs to be considered.

I need to do more homework on this :-)

On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman 
wrote:

> Hi Ted,
>
> So unfortunately after looking into the cluster manager that I will be
> using for my testing (I'm using a super-computer called XSEDE rather than
> AWS), it looks like the cluster does not actually come with Hbase installed
> (this cluster is becoming somewhat problematic, as it is essentially AWS
> but you have to do your own virtualization scripts). Do you have any other
> thoughts on how I could go about dealing with this purely using spark and
> HDFS?
>
> Thank you
>
> On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
>
>> Thank you Ted! That sounds like it would probably be the most efficient
>> (with the least overhead) way of handling this situation.
>>
>> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu  wrote:
>>
>>> Another approach is to store the objects in NoSQL store such as HBase.
>>>
>>> Looking up object should be very fast.
>>>
>>> Cheers
>>>
>>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
>>> daniel.imber...@gmail.com> wrote:
>>>
 I'm looking for a way to send structures to pre-determined partitions
 so that
 they can be used by another RDD in a mapPartition.

 Essentially I'm given and RDD of SparseVectors and an RDD of inverted
 indexes. The inverted index objects are quite large.

 My hope is to do a MapPartitions within the RDD of vectors where I can
 compare each vector to the inverted index. The issue is that I only
 NEED one
 inverted index object per partition (which would have the same key as
 the
 values within that partition).


 val vectors:RDD[(Int, SparseVector)]

 val invertedIndexes:RDD[(Int, InvIndex)] =
 a.reduceByKey(generateInvertedIndex)
 vectors:RDD.mapPartitions{
 iter =>
  val invIndex = invertedIndexes(samePartitionKey)
  iter.map(invIndex.calculateSimilarity(_))
  )
 }

 How could I go about setting up the Partition such that the specific
 data
 structure I need will be present for the mapPartition but I won't have
 the
 extra overhead of sending over all values (which would happen if I were
 to
 make a broadcast variable).

 One thought I have been having is to store the objects in HDFS but I'm
 not
 sure if that would be a suboptimal solution (It seems like it could slow
 down the process a lot)

 Another thought I am currently exploring is whether there is some way I
 can
 create a custom Partition or Partitioner that could hold the data
 structure
 (Although that might get too complicated and become problematic)

 Any thoughts on how I could attack this issue would be highly
 appreciated.

 thank you for your help!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>


Re: Sending large objects to specific RDDs

2016-01-14 Thread Daniel Imberman
Hi Ted,

So unfortunately after looking into the cluster manager that I will be
using for my testing (I'm using a super-computer called XSEDE rather than
AWS), it looks like the cluster does not actually come with Hbase installed
(this cluster is becoming somewhat problematic, as it is essentially AWS
but you have to do your own virtualization scripts). Do you have any other
thoughts on how I could go about dealing with this purely using spark and
HDFS?

Thank you

On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman 
wrote:

> Thank you Ted! That sounds like it would probably be the most efficient
> (with the least overhead) way of handling this situation.
>
> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu  wrote:
>
>> Another approach is to store the objects in NoSQL store such as HBase.
>>
>> Looking up object should be very fast.
>>
>> Cheers
>>
>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> I'm looking for a way to send structures to pre-determined partitions so
>>> that
>>> they can be used by another RDD in a mapPartition.
>>>
>>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>>> indexes. The inverted index objects are quite large.
>>>
>>> My hope is to do a MapPartitions within the RDD of vectors where I can
>>> compare each vector to the inverted index. The issue is that I only NEED
>>> one
>>> inverted index object per partition (which would have the same key as the
>>> values within that partition).
>>>
>>>
>>> val vectors:RDD[(Int, SparseVector)]
>>>
>>> val invertedIndexes:RDD[(Int, InvIndex)] =
>>> a.reduceByKey(generateInvertedIndex)
>>> vectors:RDD.mapPartitions{
>>> iter =>
>>>  val invIndex = invertedIndexes(samePartitionKey)
>>>  iter.map(invIndex.calculateSimilarity(_))
>>>  )
>>> }
>>>
>>> How could I go about setting up the Partition such that the specific data
>>> structure I need will be present for the mapPartition but I won't have
>>> the
>>> extra overhead of sending over all values (which would happen if I were
>>> to
>>> make a broadcast variable).
>>>
>>> One thought I have been having is to store the objects in HDFS but I'm
>>> not
>>> sure if that would be a suboptimal solution (It seems like it could slow
>>> down the process a lot)
>>>
>>> Another thought I am currently exploring is whether there is some way I
>>> can
>>> create a custom Partition or Partitioner that could hold the data
>>> structure
>>> (Although that might get too complicated and become problematic)
>>>
>>> Any thoughts on how I could attack this issue would be highly
>>> appreciated.
>>>
>>> thank you for your help!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Sending large objects to specific RDDs

2016-01-13 Thread Ted Yu
Another approach is to store the objects in NoSQL store such as HBase.

Looking up object should be very fast.

Cheers

On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman  wrote:

> I'm looking for a way to send structures to pre-determined partitions so
> that
> they can be used by another RDD in a mapPartition.
>
> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
> indexes. The inverted index objects are quite large.
>
> My hope is to do a MapPartitions within the RDD of vectors where I can
> compare each vector to the inverted index. The issue is that I only NEED
> one
> inverted index object per partition (which would have the same key as the
> values within that partition).
>
>
> val vectors:RDD[(Int, SparseVector)]
>
> val invertedIndexes:RDD[(Int, InvIndex)] =
> a.reduceByKey(generateInvertedIndex)
> vectors:RDD.mapPartitions{
> iter =>
>  val invIndex = invertedIndexes(samePartitionKey)
>  iter.map(invIndex.calculateSimilarity(_))
>  )
> }
>
> How could I go about setting up the Partition such that the specific data
> structure I need will be present for the mapPartition but I won't have the
> extra overhead of sending over all values (which would happen if I were to
> make a broadcast variable).
>
> One thought I have been having is to store the objects in HDFS but I'm not
> sure if that would be a suboptimal solution (It seems like it could slow
> down the process a lot)
>
> Another thought I am currently exploring is whether there is some way I can
> create a custom Partition or Partitioner that could hold the data structure
> (Although that might get too complicated and become problematic)
>
> Any thoughts on how I could attack this issue would be highly appreciated.
>
> thank you for your help!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
Thank you Ted! That sounds like it would probably be the most efficient
(with the least overhead) way of handling this situation.

On Wed, Jan 13, 2016 at 11:36 AM Ted Yu  wrote:

> Another approach is to store the objects in NoSQL store such as HBase.
>
> Looking up object should be very fast.
>
> Cheers
>
> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
>
>> I'm looking for a way to send structures to pre-determined partitions so
>> that
>> they can be used by another RDD in a mapPartition.
>>
>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>> indexes. The inverted index objects are quite large.
>>
>> My hope is to do a MapPartitions within the RDD of vectors where I can
>> compare each vector to the inverted index. The issue is that I only NEED
>> one
>> inverted index object per partition (which would have the same key as the
>> values within that partition).
>>
>>
>> val vectors:RDD[(Int, SparseVector)]
>>
>> val invertedIndexes:RDD[(Int, InvIndex)] =
>> a.reduceByKey(generateInvertedIndex)
>> vectors:RDD.mapPartitions{
>> iter =>
>>  val invIndex = invertedIndexes(samePartitionKey)
>>  iter.map(invIndex.calculateSimilarity(_))
>>  )
>> }
>>
>> How could I go about setting up the Partition such that the specific data
>> structure I need will be present for the mapPartition but I won't have the
>> extra overhead of sending over all values (which would happen if I were to
>> make a broadcast variable).
>>
>> One thought I have been having is to store the objects in HDFS but I'm not
>> sure if that would be a suboptimal solution (It seems like it could slow
>> down the process a lot)
>>
>> Another thought I am currently exploring is whether there is some way I
>> can
>> create a custom Partition or Partitioner that could hold the data
>> structure
>> (Although that might get too complicated and become problematic)
>>
>> Any thoughts on how I could attack this issue would be highly appreciated.
>>
>> thank you for your help!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>