I was hoping I could make the system behave as a blocking queue : if the 
outputs is too slow, buffers (storing space for RDDs) fills up, then blocks 
instead of dropping existing rdds, until the input itself blocks (slows down 
it’s consumption).

On a side note I was wondering: is there the same issue with file (hdfs) inputs 
? how can I be sure the input won’t “overflow” the process chain ?


From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: mardi 12 août 2014 02:58
To: Gwenhael Pasquiers
Cc: u...@spark.incubator.apache.org
Subject: Re: [spark-streaming] kafka source and flow control

Hi,

On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers 
<gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote:
We intend to apply other operations on the data later in the same spark 
context, but our first step is to archive it.

Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to hdfs AND send to step 3
Step 3 : transform data
Step 4 : save transformed data to HDFS as input for M/R

I see. Well I think Spark Streaming may be well suited for that purpose.

To us it looks like a great flaw if, in streaming mode, spark-streaming cannot 
slow down it’s consumption depending on the available resources.

On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers 
<gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote:
I think the kind of self-regulating system you describe would be too difficult 
to implement and probably unreliable (even more with the fact that we have 
multiple slaves).

Isn't "slow down its consumption depending on the available resources" a 
"self-regulating system"? I don't see how you can adapt to available resources 
without measuring your execution time and then change how much you consume. Did 
you have any particular form of adaption in mind?

Tobias

Reply via email to