Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Raghavendra Pandey
We are using Cassandra for similar kind of problem and it works well... You need to take care of race condition between updating the store and looking up the store... On Aug 29, 2015 1:31 AM, Ted Yu yuzhih...@gmail.com wrote: +1 on Jason's suggestion. bq. this large variable is broadcast many

Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Hemminger Jeff
Thanks for the recommendations. I had been focused on solving the problem within Spark but a distributed database sounds like a better solution. Jeff On Sat, Aug 29, 2015 at 11:47 PM, Ted Yu yuzhih...@gmail.com wrote: Not sure if the race condition you mentioned is related to Cassandra's data

Alternative to Large Broadcast Variables

2015-08-28 Thread Hemminger Jeff
Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process. From what I

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Ted Yu
+1 on Jason's suggestion. bq. this large variable is broadcast many times during the lifetime Please consider making this large variable more granular. Meaning, reduce the amount of data transferred between the key value store and your app during update. Cheers On Fri, Aug 28, 2015 at 12:44

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Jason
You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing