Hi,

We intend to apply other operations on the data later in the same spark 
context, but our first step is to archive it.

Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to hdfs AND send to step 3
Step 3 : transform data
Step 4 : save transformed data to HDFS as input for M/R

Yes, maybe spark-streaming isn’t the tool adapted to our needs, but it looked 
like it so I wonder if I didn’t miss something.
To us it looks like a great flaw if, in streaming mode, spark-streaming cannot 
slow down it’s consumption depending on the available resources.

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: lundi 11 août 2014 11:44
To: Gwenhael Pasquiers
Subject: Re: [spark-streaming] kafka source and flow control

Hi,

On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers 
<gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote:
I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs.

I assume you are doing something else in between? If not, maybe a software such 
as Apache Flume may be better suited?

I have complete control over the kafka consumer since I developed a custom
Receiver as a workaround to :
https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow
control more intelligent than a simple rate limite (x messages or bytes per
second).

I’m interested in all ideas or suggestions.

I think what I would try to do is measuring how much data I can process within 
one time window (i.e., keep track of processing speed) and then (continuously?) 
adapt the data rate to something that I am capable of processing. In that case 
you would have to make sure that the data doesn't instead get lost within 
Kafka. After all, the problem seems to be that your HDFS is too slow and you'll 
have to buffer that data *somewhere*, right?

Tobias

Reply via email to