We are using Cassandra for similar kind of problem and it works well... You
need to take care of race condition between updating the store and looking
up the store...
On Aug 29, 2015 1:31 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:

> +1 on Jason's suggestion.
>
> bq. this large variable is broadcast many times during the lifetime
>
> Please consider making this large variable more granular. Meaning, reduce
> the amount of data transferred between the key value store and your app
> during update.
>
> Cheers
>
> On Fri, Aug 28, 2015 at 12:44 PM, Jason <ja...@jasonknight.us> wrote:
>
>> You could try using an external key value store (like HBase, Redis) and
>> perform lookups/updates inside of your mappers (you'd need to create the
>> connection within a mapPartitions code block to avoid the connection
>> setup/teardown overhead)?
>>
>> I haven't done this myself though, so I'm just throwing the idea out
>> there.
>>
>> On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff <j...@atware.co.jp> wrote:
>>
>>> Hi,
>>>
>>> I am working on a Spark application that is using of a large (~3G)
>>> broadcast variable as a lookup table. The application refines the data in
>>> this lookup table in an iterative manner. So this large variable is
>>> broadcast many times during the lifetime of the application process.
>>>
>>> From what I have observed perhaps 60% of the execution time is spent
>>> waiting for the variable to broadcast in each iteration. My reading of a
>>> Spark performance article[1] suggests that the time spent broadcasting will
>>> increase with the number of nodes I add.
>>>
>>> My question for the group - what would you suggest as an alternative to
>>> broadcasting a large variable like this?
>>>
>>> One approach I have considered is segmenting my RDD and adding a copy of
>>> the lookup table for each X number of values to process. So, for example,
>>> if I have a list of 1 million entries to process (eg, RDD[Entry]), I could
>>> split this into segments of 100K entries, with a copy of the lookup table,
>>> and make that an RDD[(Lookup, Array[Entry]).
>>>
>>> Another solution I am looking at it is making the lookup table an RDD
>>> instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to
>>> improve performance. One issue with this approach is that I would have to
>>> rewrite my application code to use two RDDs so that I do not reference the
>>> lookup RDD in the from within the closure of another RDD.
>>>
>>> Any other recommendations?
>>>
>>> Jeff
>>>
>>>
>>> [1]
>>> http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf
>>>
>>> [2]https://github.com/amplab/spark-indexedrdd
>>>
>>
>

Reply via email to