Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu

what’s your data format? ORC or CSV or others?

val keys = sqlContext.read.orc(“your previous batch data 
path”).select($”uniq_key”).collect
val broadCast = sc.broadCast(keys)

val rdd = your_current_batch_data
rdd.filter( line => line.key  not in broadCase.value)






> On Dec 8, 2015, at 4:44 PM, Ramkumar V  wrote:
> 
> Im running spark batch job in cluster mode every hour and it runs for 15 
> minutes. I have certain unique keys in the dataset. i dont want to process 
> those keys during my next hour batch.
> 
> Thanks,
> 
>   
> 
> 
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu  > wrote:
> Can you detail your question?  what looks like your previous batch and the 
> current batch?
> 
> 
> 
> 
> 
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V > > wrote:
>> 
>> Hi,
>> 
>> I'm running java over spark in cluster mode. I want to apply filter on 
>> javaRDD based on some previous batch values. if i store those values in 
>> mapDB, is it possible to apply filter during the current batch ?
>> 
>> Thanks,
>> 
>>   
>> 
> 
> 



Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Im running spark batch job in cluster mode every hour and it runs for 15
minutes. I have certain unique keys in the dataset. i dont want to process
those keys during my next hour batch.

*Thanks*,



On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
wrote:

> Can you detail your question?  what looks like your previous batch and the
> current batch?
>
>
>
>
>
> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>
> Hi,
>
> I'm running java over spark in cluster mode. I want to apply filter on
> javaRDD based on some previous batch values. if i store those values in
> mapDB, is it possible to apply filter during the current batch ?
>
> *Thanks*,
> 
>
>
>


Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Pipe separated value. I know broadcast and join works. but i would like to
know mapDB works or not ?

*Thanks*,



On Tue, Dec 8, 2015 at 2:22 PM, Fengdong Yu 
wrote:

>
> what’s your data format? ORC or CSV or others?
>
> val keys = sqlContext.read.orc(“your previous batch data
> path”).select($”uniq_key”).collect
> val broadCast = sc.broadCast(keys)
>
> val rdd = your_current_batch_data
> rdd.filter( line => line.key  not in broadCase.value)
>
>
>
>
>
>
> On Dec 8, 2015, at 4:44 PM, Ramkumar V  wrote:
>
> Im running spark batch job in cluster mode every hour and it runs for 15
> minutes. I have certain unique keys in the dataset. i dont want to process
> those keys during my next hour batch.
>
> *Thanks*,
> 
>
>
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
> wrote:
>
>> Can you detail your question?  what looks like your previous batch and
>> the current batch?
>>
>>
>>
>>
>>
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>>
>> Hi,
>>
>> I'm running java over spark in cluster mode. I want to apply filter on
>> javaRDD based on some previous batch values. if i store those values in
>> mapDB, is it possible to apply filter during the current batch ?
>>
>> *Thanks*,
>> 
>>
>>
>>
>
>


Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
Can you detail your question?  what looks like your previous batch and the 
current batch?





> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
> 
> Hi,
> 
> I'm running java over spark in cluster mode. I want to apply filter on 
> javaRDD based on some previous batch values. if i store those values in 
> mapDB, is it possible to apply filter during the current batch ?
> 
> Thanks,
> 
>   
> 



Re: Spark with MapDB

2015-12-08 Thread Jörn Franke
You may want to use a bloom filter for this, but make sure that you understand 
how it works

> On 08 Dec 2015, at 09:44, Ramkumar V  wrote:
> 
> Im running spark batch job in cluster mode every hour and it runs for 15 
> minutes. I have certain unique keys in the dataset. i dont want to process 
> those keys during my next hour batch.
> 
> Thanks,
> 
>  
> 
> 
>> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu  wrote:
>> Can you detail your question?  what looks like your previous batch and the 
>> current batch?
>> 
>> 
>> 
>> 
>> 
>>> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>>> 
>>> Hi,
>>> 
>>> I'm running java over spark in cluster mode. I want to apply filter on 
>>> javaRDD based on some previous batch values. if i store those values in 
>>> mapDB, is it possible to apply filter during the current batch ?
>>> 
>>> Thanks,
>>> 
>>>  
>>> 
> 


Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Yes, I agree but the data is in the form of RDD and also im running it
cluster mode so the data should be distributed across all machines in the
cluster. but if i use bloom filter or mapDB which is non distributed. How
will it works in this case ?

*Thanks*,



On Tue, Dec 8, 2015 at 5:30 PM, Jörn Franke  wrote:

> You may want to use a bloom filter for this, but make sure that you
> understand how it works
>
> On 08 Dec 2015, at 09:44, Ramkumar V  wrote:
>
> Im running spark batch job in cluster mode every hour and it runs for 15
> minutes. I have certain unique keys in the dataset. i dont want to process
> those keys during my next hour batch.
>
> *Thanks*,
> 
>
>
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
> wrote:
>
>> Can you detail your question?  what looks like your previous batch and
>> the current batch?
>>
>>
>>
>>
>>
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>>
>> Hi,
>>
>> I'm running java over spark in cluster mode. I want to apply filter on
>> javaRDD based on some previous batch values. if i store those values in
>> mapDB, is it possible to apply filter during the current batch ?
>>
>> *Thanks*,
>> 
>>
>>
>>
>


Spark with MapDB

2015-12-07 Thread Ramkumar V
Hi,

I'm running java over spark in cluster mode. I want to apply filter on
javaRDD based on some previous batch values. if i store those values in
mapDB, is it possible to apply filter during the current batch ?

*Thanks*,