Ok, I'll try to start from that when I'll try to implement it. Thanks again for the great support!
Best, Flavio On Thu, Jun 19, 2014 at 10:57 AM, Michael Cutler <mich...@tumra.com> wrote: > Hi Flavio, > > When your streaming job starts somewhere in the cluster the Receiver will > be started in its own thread/process. You can do whatever you like within > the receiver e.g. start and manage your own thread pool to fetch external > data and feed Spark. If your Receiver dies Spark will re-provision it on > another worker and start it again. > > A very simple way to control the flow rate would be to make a parameter to > your Receiver called rateLimit (messages per second) for example, within > your Receiver code you would use the limit to control the rate with which > you make external calls. > > When you receive the data you can just use store(message) and not worry > about any intermediate buffering and you can simply tune rateLimit to the > rate/latency you're processing batches in your Job. Take a look at the > other implementations like KafkaInputDStream > <https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala> > to > see how they manage it using several worker threads. > > My suggestion would be to knock-up a basic custom receiver and give it a > shot! > > MC > > > On 19 June 2014 09:31, Flavio Pompermaier <pomperma...@okkam.it> wrote: > >> Hi Michael, >> thanks for the tip, it's really an elegant solution. >> What I'm still missing here (maybe I should take a look at the code of >> TwitterInputDStream >> <https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala>..) >> is how to limit the external service call rate and manage the incoming >> buffer size (enqueuing). >> Could you give me some tips for that? >> >> Thanks again, >> Flavio >> >> >> On Thu, Jun 19, 2014 at 10:19 AM, Michael Cutler <mich...@tumra.com> >> wrote: >> >>> Hello Flavio, >>> >>> It sounds to me like the best solution for you is to implement your own >>> ReceiverInputDStream/Receiver component to feed Spark Streaming with >>> DStreams. It is not as scary as it sounds, take a look at some of the >>> examples like TwitterInputDStream >>> <https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala> >>> which >>> are ~60 lines of code. >>> >>> The key snippet to look for is: >>> >>> def onStart() { >>> try { >>> >>> >>> >>> >>> >>> val newTwitterStream = new >>> TwitterStreamFactory().getInstance(twitterAuth) >>> >>> >>> >>> >>> newTwitterStream.addListener(new StatusListener { >>> >>> >>> >>> >>> def onStatus(status: Status) = { >>> >>> >>> >>> >>> *store(status)* >>> } >>> ... >>> >>> >>> This is the interface between twitter4j (external library which receives >>> tweets) and Spark, basically you call store() which passes the object >>> down to Spark to be batched into a DStream and fed into Spark Streaming. >>> There is some excellent documentation on implementing Custom Receivers >>> here: >>> http://spark.apache.org/docs/latest/streaming-custom-receivers.html >>> >>> Using your own custom receiver means you can implement flow-control to >>> ingest data at sensible rates, as well as failover/retry logic etc. >>> >>> Best of luck! >>> >>> MC >>> >>> >>> >>> >>> >>> *Michael Cutler* >>> Founder, CTO >>> >>> >>> * Mobile: +44 789 990 7847 Email: mich...@tumra.com >>> <mich...@tumra.com> Web: tumra.com >>> <http://tumra.com/?utm_source=signature&utm_medium=email> * >>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* >>> *Registered in England & Wales, 07916412. VAT No. 130595328* >>> >>> >>> This email and any files transmitted with it are confidential and may >>> also be privileged. It is intended only for the person to whom it is >>> addressed. If you have received this email in error, please inform the >>> sender immediately. If you are not the intended recipient you must not >>> use, disclose, copy, print, distribute or rely on this email. >>> >>> >>> On 19 June 2014 07:50, Flavio Pompermaier <pomperma...@okkam.it> wrote: >>> >>>> Yes, I need to call the external service for every event and the order >>>> does not matter. >>>> There's no time limit in which each events should be processed. I can't >>>> tell the producer to slow down nor drop events. >>>> Of course I could put a message broker in between like an AMQP or JMS >>>> broker but I was thinking that maybe this issue was already addressed in >>>> some way (of course there should be some buffer to process high rate >>>> streaming)..or not? >>>> >>>> >>>> >>>> On Thu, Jun 19, 2014 at 4:48 AM, Soumya Simanta < >>>> soumya.sima...@gmail.com> wrote: >>>> >>>>> Flavio - i'm new to Spark as well but I've done stream processing >>>>> using other frameworks. My comments below are not spark-streaming >>>>> specific. >>>>> Maybe someone who know more can provide better insights. >>>>> >>>>> I read your post on my phone and I believe my answer doesn't >>>>> completely address the issue you have raised. >>>>> >>>>> Do you need to call the external service for every event ? i.e., do >>>>> you need to process all events ? Also does order of processing events >>>>> matter? Is there is time bound in which each event should be processed ? >>>>> >>>>> Calling an external service means network IO. So you have to buffer >>>>> events if your service is rate limited or slower than rate at which you >>>>> are >>>>> processing your event. >>>>> >>>>> Here are some ways of dealing with this situation: >>>>> >>>>> 1. Drop events based on a policy (such as buffer/queue size), >>>>> 2. Tell the event producer to slow down if that's in your control >>>>> 3. Use a proxy or a set of proxies to distribute the calls to the >>>>> remote service, if the rate limit is by user or network node only. >>>>> >>>>> I'm not sure how many of these are implemented directly in Spark >>>>> streaming but you can have an external component that can : >>>>> control the rate of event and only send events to Spark streams when >>>>> it's ready to process more messages. >>>>> >>>>> Hope this helps. >>>>> >>>>> -Soumya >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier < >>>>> pomperma...@okkam.it> wrote: >>>>> >>>>>> Thanks for the quick reply soumya. Unfortunately I'm a newbie with >>>>>> Spark..what do you mean? is there any reference to how to do that? >>>>>> >>>>>> >>>>>> On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta < >>>>>> soumya.sima...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> You can add a back pressured enabled component in front that feeds >>>>>>> data into Spark. This component can control in input rate to spark. >>>>>>> >>>>>>> > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier < >>>>>>> pomperma...@okkam.it> wrote: >>>>>>> > >>>>>>> > Hi to all, >>>>>>> > in my use case I'd like to receive events and call an external >>>>>>> service as they pass through. Is it possible to limit the number of >>>>>>> contemporaneous call to that service (to avoid DoS) using Spark >>>>>>> streaming? >>>>>>> if so, limiting the rate implies a possible buffer growth...how can I >>>>>>> control the buffer of incoming events waiting to be processed? >>>>>>> > >>>>>>> > Best, >>>>>>> > Flavio >>>>>>> >>>>>> >>>> >>