Ok, I'll try to start from that when I'll try to implement it.
Thanks again for the great support!

Best,
Flavio


On Thu, Jun 19, 2014 at 10:57 AM, Michael Cutler <mich...@tumra.com> wrote:

> Hi Flavio,
>
> When your streaming job starts somewhere in the cluster the Receiver will
> be started in its own thread/process.  You can do whatever you like within
> the receiver e.g. start and manage your own thread pool to fetch external
> data and feed Spark.  If your Receiver dies Spark will re-provision it on
> another worker and start it again.
>
> A very simple way to control the flow rate would be to make a parameter to
> your Receiver called rateLimit (messages per second) for example, within
> your Receiver code you would use the limit to control the rate with which
> you make external calls.
>
> When you receive the data you can just use store(message) and not worry
> about any intermediate buffering and you can simply tune rateLimit to the
> rate/latency you're processing batches in your Job.  Take a look at the
> other implementations like KafkaInputDStream
> <https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala>
>  to
> see how they manage it using several worker threads.
>
> My suggestion would be to knock-up a basic custom receiver and give it a
> shot!
>
> MC
>
>
> On 19 June 2014 09:31, Flavio Pompermaier <pomperma...@okkam.it> wrote:
>
>> Hi Michael,
>> thanks for the tip, it's really an elegant solution.
>> What I'm still missing here (maybe I should take a look at the code of
>> TwitterInputDStream
>> <https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala>..)
>> is how to limit the external service call rate and manage the incoming
>> buffer size (enqueuing).
>> Could you give me some tips for that?
>>
>> Thanks again,
>> Flavio
>>
>>
>> On Thu, Jun 19, 2014 at 10:19 AM, Michael Cutler <mich...@tumra.com>
>> wrote:
>>
>>> Hello Flavio,
>>>
>>> It sounds to me like the best solution for you is to implement your own
>>> ReceiverInputDStream/Receiver component to feed Spark Streaming with
>>> DStreams.  It is not as scary as it sounds, take a look at some of the
>>> examples like TwitterInputDStream
>>> <https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala>
>>>  which
>>> are ~60 lines of code.
>>>
>>> The key snippet to look for is:
>>>
>>>   def onStart() {
>>>     try {
>>>
>>>
>>>
>>>
>>>
>>>       val newTwitterStream = new 
>>> TwitterStreamFactory().getInstance(twitterAuth)
>>>
>>>
>>>
>>>
>>>       newTwitterStream.addListener(new StatusListener {
>>>
>>>
>>>
>>>
>>>         def onStatus(status: Status) = {
>>>
>>>
>>>
>>>
>>>           *store(status)*
>>>         }
>>>       ...
>>>
>>>
>>> This is the interface between twitter4j (external library which receives
>>> tweets) and Spark, basically you call store() which passes the object
>>> down to Spark to be batched into a DStream and fed into Spark Streaming.
>>>  There is some excellent documentation on implementing Custom Receivers
>>> here:
>>> http://spark.apache.org/docs/latest/streaming-custom-receivers.html
>>>
>>> Using your own custom receiver means you can implement flow-control to
>>> ingest data at sensible rates, as well as failover/retry logic etc.
>>>
>>> Best of luck!
>>>
>>> MC
>>>
>>>
>>>
>>>
>>>
>>>  *Michael Cutler*
>>> Founder, CTO
>>>
>>>
>>> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com
>>> <mich...@tumra.com> Web:     tumra.com
>>> <http://tumra.com/?utm_source=signature&utm_medium=email> *
>>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
>>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>>
>>>
>>> This email and any files transmitted with it are confidential and may
>>> also be privileged. It is intended only for the person to whom it is
>>> addressed. If you have received this email in error, please inform the
>>> sender immediately. If you are not the intended recipient you must not
>>> use, disclose, copy, print, distribute or rely on this email.
>>>
>>>
>>> On 19 June 2014 07:50, Flavio Pompermaier <pomperma...@okkam.it> wrote:
>>>
>>>> Yes, I need to call the external service for every event and the order
>>>> does not matter.
>>>> There's no time limit in which each events should be processed. I can't
>>>> tell the producer to slow down nor drop events.
>>>> Of course I could put a message broker in between like an AMQP or JMS
>>>> broker but I was thinking that maybe this issue was already addressed in
>>>> some way (of course there should be some buffer to process high rate
>>>> streaming)..or not?
>>>>
>>>>
>>>>
>>>> On Thu, Jun 19, 2014 at 4:48 AM, Soumya Simanta <
>>>> soumya.sima...@gmail.com> wrote:
>>>>
>>>>> Flavio - i'm new to Spark as well but I've done stream processing
>>>>> using other frameworks. My comments below are not spark-streaming 
>>>>> specific.
>>>>> Maybe someone who know more can provide better insights.
>>>>>
>>>>> I read your post on my phone and I believe my answer doesn't
>>>>> completely address the issue you have raised.
>>>>>
>>>>> Do you need to call the external service for every event ? i.e., do
>>>>> you need to process all events ? Also does order of processing events
>>>>> matter? Is there is time bound in which each event should be processed ?
>>>>>
>>>>> Calling an external service means network IO. So you have to buffer
>>>>> events if your service is rate limited or slower than rate at which you 
>>>>> are
>>>>> processing your event.
>>>>>
>>>>> Here are some ways of dealing with this situation:
>>>>>
>>>>> 1. Drop events based on a policy (such as buffer/queue size),
>>>>> 2. Tell the event producer to slow down if that's in your control
>>>>> 3. Use a proxy or a set of proxies to distribute the calls to the
>>>>> remote service, if the rate limit is by user or network node only.
>>>>>
>>>>> I'm not sure how many of these are implemented directly in Spark
>>>>> streaming but you can have an external component that can :
>>>>> control the rate of event and only send events to Spark streams when
>>>>> it's ready to process more messages.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> -Soumya
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier <
>>>>> pomperma...@okkam.it> wrote:
>>>>>
>>>>>> Thanks for the quick reply soumya. Unfortunately I'm a newbie with
>>>>>> Spark..what do you mean? is there any reference to how to do that?
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta <
>>>>>> soumya.sima...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> You can add a back pressured enabled component in front that feeds
>>>>>>> data into Spark. This component can control in input rate to spark.
>>>>>>>
>>>>>>> > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier <
>>>>>>> pomperma...@okkam.it> wrote:
>>>>>>> >
>>>>>>> > Hi to all,
>>>>>>> > in my use case I'd like to receive events and call an external
>>>>>>> service as they pass through. Is it possible to limit the number of
>>>>>>> contemporaneous call to that service (to avoid DoS) using Spark 
>>>>>>> streaming?
>>>>>>> if so, limiting the rate implies a possible buffer growth...how can I
>>>>>>> control the buffer of incoming events waiting to be processed?
>>>>>>> >
>>>>>>> > Best,
>>>>>>> > Flavio
>>>>>>>
>>>>>>
>>>>
>>

Reply via email to