Hi Flavio,

When your streaming job starts somewhere in the cluster the Receiver will
be started in its own thread/process.  You can do whatever you like within
the receiver e.g. start and manage your own thread pool to fetch external
data and feed Spark.  If your Receiver dies Spark will re-provision it on
another worker and start it again.

A very simple way to control the flow rate would be to make a parameter to
your Receiver called rateLimit (messages per second) for example, within
your Receiver code you would use the limit to control the rate with which
you make external calls.

When you receive the data you can just use store(message) and not worry
about any intermediate buffering and you can simply tune rateLimit to the
rate/latency you're processing batches in your Job.  Take a look at the
other implementations like KafkaInputDStream
<https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala>
to
see how they manage it using several worker threads.

My suggestion would be to knock-up a basic custom receiver and give it a
shot!

MC


On 19 June 2014 09:31, Flavio Pompermaier <pomperma...@okkam.it> wrote:

> Hi Michael,
> thanks for the tip, it's really an elegant solution.
> What I'm still missing here (maybe I should take a look at the code of
> TwitterInputDStream
> <https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala>..)
> is how to limit the external service call rate and manage the incoming
> buffer size (enqueuing).
> Could you give me some tips for that?
>
> Thanks again,
> Flavio
>
>
> On Thu, Jun 19, 2014 at 10:19 AM, Michael Cutler <mich...@tumra.com>
> wrote:
>
>> Hello Flavio,
>>
>> It sounds to me like the best solution for you is to implement your own
>> ReceiverInputDStream/Receiver component to feed Spark Streaming with
>> DStreams.  It is not as scary as it sounds, take a look at some of the
>> examples like TwitterInputDStream
>> <https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala>
>>  which
>> are ~60 lines of code.
>>
>> The key snippet to look for is:
>>
>>   def onStart() {
>>     try {
>>
>>
>>
>>       val newTwitterStream = new 
>> TwitterStreamFactory().getInstance(twitterAuth)
>>
>>
>>       newTwitterStream.addListener(new StatusListener {
>>
>>
>>         def onStatus(status: Status) = {
>>
>>
>>           *store(status)*
>>         }
>>       ...
>>
>>
>> This is the interface between twitter4j (external library which receives
>> tweets) and Spark, basically you call store() which passes the object
>> down to Spark to be batched into a DStream and fed into Spark Streaming.
>>  There is some excellent documentation on implementing Custom Receivers
>> here: http://spark.apache.org/docs/latest/streaming-custom-receivers.html
>>
>> Using your own custom receiver means you can implement flow-control to
>> ingest data at sensible rates, as well as failover/retry logic etc.
>>
>> Best of luck!
>>
>> MC
>>
>>
>>
>>
>>
>>  *Michael Cutler*
>> Founder, CTO
>>
>>
>> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com <mich...@tumra.com>
>> Web:     tumra.com
>> <http://tumra.com/?utm_source=signature&utm_medium=email> *
>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>
>>
>> This email and any files transmitted with it are confidential and may
>> also be privileged. It is intended only for the person to whom it is
>> addressed. If you have received this email in error, please inform the
>> sender immediately. If you are not the intended recipient you must not
>> use, disclose, copy, print, distribute or rely on this email.
>>
>>
>> On 19 June 2014 07:50, Flavio Pompermaier <pomperma...@okkam.it> wrote:
>>
>>> Yes, I need to call the external service for every event and the order
>>> does not matter.
>>> There's no time limit in which each events should be processed. I can't
>>> tell the producer to slow down nor drop events.
>>> Of course I could put a message broker in between like an AMQP or JMS
>>> broker but I was thinking that maybe this issue was already addressed in
>>> some way (of course there should be some buffer to process high rate
>>> streaming)..or not?
>>>
>>>
>>>
>>> On Thu, Jun 19, 2014 at 4:48 AM, Soumya Simanta <
>>> soumya.sima...@gmail.com> wrote:
>>>
>>>> Flavio - i'm new to Spark as well but I've done stream processing using
>>>> other frameworks. My comments below are not spark-streaming specific. Maybe
>>>> someone who know more can provide better insights.
>>>>
>>>> I read your post on my phone and I believe my answer doesn't completely
>>>> address the issue you have raised.
>>>>
>>>> Do you need to call the external service for every event ? i.e., do you
>>>> need to process all events ? Also does order of processing events matter?
>>>> Is there is time bound in which each event should be processed ?
>>>>
>>>> Calling an external service means network IO. So you have to buffer
>>>> events if your service is rate limited or slower than rate at which you are
>>>> processing your event.
>>>>
>>>> Here are some ways of dealing with this situation:
>>>>
>>>> 1. Drop events based on a policy (such as buffer/queue size),
>>>> 2. Tell the event producer to slow down if that's in your control
>>>> 3. Use a proxy or a set of proxies to distribute the calls to the
>>>> remote service, if the rate limit is by user or network node only.
>>>>
>>>> I'm not sure how many of these are implemented directly in Spark
>>>> streaming but you can have an external component that can :
>>>> control the rate of event and only send events to Spark streams when
>>>> it's ready to process more messages.
>>>>
>>>> Hope this helps.
>>>>
>>>> -Soumya
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier <
>>>> pomperma...@okkam.it> wrote:
>>>>
>>>>> Thanks for the quick reply soumya. Unfortunately I'm a newbie with
>>>>> Spark..what do you mean? is there any reference to how to do that?
>>>>>
>>>>>
>>>>> On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta <
>>>>> soumya.sima...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> You can add a back pressured enabled component in front that feeds
>>>>>> data into Spark. This component can control in input rate to spark.
>>>>>>
>>>>>> > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier <
>>>>>> pomperma...@okkam.it> wrote:
>>>>>> >
>>>>>> > Hi to all,
>>>>>> > in my use case I'd like to receive events and call an external
>>>>>> service as they pass through. Is it possible to limit the number of
>>>>>> contemporaneous call to that service (to avoid DoS) using Spark 
>>>>>> streaming?
>>>>>> if so, limiting the rate implies a possible buffer growth...how can I
>>>>>> control the buffer of incoming events waiting to be processed?
>>>>>> >
>>>>>> > Best,
>>>>>> > Flavio
>>>>>>
>>>>>
>>>
>

Reply via email to