Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
what’s your data format? ORC or CSV or others? val keys = sqlContext.read.orc(“your previous batch data path”).select($”uniq_key”).collect val broadCast = sc.broadCast(keys) val rdd = your_current_batch_data rdd.filter( line => line.key not in broadCase.value) > On Dec 8, 2015, at 4:44

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Im running spark batch job in cluster mode every hour and it runs for 15 minutes. I have certain unique keys in the dataset. i dont want to process those keys during my next hour batch. *Thanks*, On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Pipe separated value. I know broadcast and join works. but i would like to know mapDB works or not ? *Thanks*, On Tue, Dec 8, 2015 at 2:22 PM, Fengdong Yu wrote: > > what’s your data format? ORC or CSV or others? > > val keys

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
Can you detail your question? what looks like your previous batch and the current batch? > On Dec 8, 2015, at 3:52 PM, Ramkumar V wrote: > > Hi, > > I'm running java over spark in cluster mode. I want to apply filter on > javaRDD based on some previous batch

Re: Spark with MapDB

2015-12-08 Thread Jörn Franke
You may want to use a bloom filter for this, but make sure that you understand how it works > On 08 Dec 2015, at 09:44, Ramkumar V wrote: > > Im running spark batch job in cluster mode every hour and it runs for 15 > minutes. I have certain unique keys in the dataset.

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Yes, I agree but the data is in the form of RDD and also im running it cluster mode so the data should be distributed across all machines in the cluster. but if i use bloom filter or mapDB which is non distributed. How will it works in this case ? *Thanks*,

Spark with MapDB

2015-12-07 Thread Ramkumar V
Hi, I'm running java over spark in cluster mode. I want to apply filter on javaRDD based on some previous batch values. if i store those values in mapDB, is it possible to apply filter during the current batch ? *Thanks*,