If it is indeed a reactive use case, then Spark Streaming would be a good 
choice. 
One approach worth considering - is it possible to receive a message via kafka 
(or some other queue). That'd not need any polling, and you could use standard 
consumers. If polling isn't an issue, then writing a custom receiver will work 
fine. The way a receiver works is this:
* Your receiver has a receive() function, where you'd typically start a loop. 
In your loop, you'd fetch items, and call store(entry). * You control 
everything in the receiver. If you're listening on a queue, you receive 
messages, store() and ack your queue. If you're polling, it's up to you to 
ensure delays between db calls.* The things you store() go on to make up the 
rdds in your DStream. So, intervals, windowing, etc. apply to those. The 
receiver is the boundary between your data source and the DStream RDDs. In 
other words, if your interval is 15 seconds with no windowing, then the things 
that went to store() every 15 seconds are bunched up into an RDD of your 
DStream. That's kind of a simplification, but should give you the idea that 
your "db polling" interval and streaming interval are not tied together.
-Ashic.

Date: Mon, 6 Jul 2015 01:12:34 +1000
Subject: Re: JDBC Streams
From: guha.a...@gmail.com
To: as...@live.com
CC: ak...@sigmoidanalytics.com; user@spark.apache.org

Hi

Thanks for the reply. here is my situation: I hve a DB which enbles synchronus 
CDC, think this as a DBtrigger which writes to a taable with "changed" values 
as soon as something changes in production table. My job will need to pick up 
the data "as soon as it arrives" which can be every 1 min interval. Ideally it 
will pick up the changes, transform it into a jsonand puts it to kinesis. In 
short, I am emulating a Kinesis producer with a DB source (dont even ask why, 
lets say these are the constraints :) )

Please advice (a) is spark a good choice here (b)  whats your suggestion either 
way.

I understand I can easily do it using a simple java/python app but I am little 
worried about managing scaling/fault tolerance and thats where my concern is.

TIA
Ayan

On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:



Hi Ayan,How "continuous" is your workload? As Akhil points out, with streaming, 
you'll give up at least one core for receiving, will need at most one more core 
for processing. Unless you're running on something like Mesos, this means that 
those cores are dedicated to your app, and can't be leveraged by other apps / 
jobs.
If it's something periodic (once an hour, once every 15 minutes, etc.), then 
I'd simply write a "normal" spark application, and trigger it periodically. 
There are many things that can take care of that - sometimes a simple cronjob 
is enough!

Date: Sun, 5 Jul 2015 22:48:37 +1000
Subject: Re: JDBC Streams
From: guha.a...@gmail.com
To: ak...@sigmoidanalytics.com
CC: user@spark.apache.org

Thanks Akhil. In case I go with spark streaming, I guess I have to implment a 
custom receiver and spark streaming will call this receiver every batch 
interval, is that correct? Any gotcha you see in this plan? TIA...Best, Ayan

On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
If you want a long running application, then go with spark streaming (which 
kind of blocks your resources). On the other hand, if you use job server then 
you can actually use the resources (CPUs) for other jobs also when your dbjob 
is not using them.ThanksBest Regards

On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <guha.a...@gmail.com> wrote:
Hi All

I have a requireent to connect to a DB every few minutes and bring data to 
HBase. Can anyone suggest if spark streaming would be appropriate for this 
senario or I shoud look into jobserver?

Thanks in advance
-- 
Best Regards,
Ayan Guha






-- 
Best Regards,
Ayan Guha

                                          


-- 
Best Regards,
Ayan Guha

                                          

Reply via email to