I didn’t reply to the last part of your message: My source is Kafka, kafka already acts as a buffer with a lot of space.
So when I start my spark job, there is a lot of data to catch up (and it is critical not to lose any), but the kafka consumer goes as fast as it can (and it’s faster than my hdfs’s maximum). I think the kind of self-regulating system you describe would be too difficult to implement and probably unreliable (even more with the fact that we have multiple slaves). From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: lundi 11 août 2014 11:44 To: Gwenhael Pasquiers Subject: Re: [spark-streaming] kafka source and flow control Hi, On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: I’m using spark-streaming in a cloudera environment to consume a kafka source and store all data into hdfs. I assume you are doing something else in between? If not, maybe a software such as Apache Flume may be better suited? I have complete control over the kafka consumer since I developed a custom Receiver as a workaround to : https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow control more intelligent than a simple rate limite (x messages or bytes per second). I’m interested in all ideas or suggestions. I think what I would try to do is measuring how much data I can process within one time window (i.e., keep track of processing speed) and then (continuously?) adapt the data rate to something that I am capable of processing. In that case you would have to make sure that the data doesn't instead get lost within Kafka. After all, the problem seems to be that your HDFS is too slow and you'll have to buffer that data *somewhere*, right? Tobias