RE: [spark-streaming] kafka source and flow control

2014-08-12 Thread Gwenhael Pasquiers
I was hoping I could make the system behave as a blocking queue : if the 
outputs is too slow, buffers (storing space for RDDs) fills up, then blocks 
instead of dropping existing rdds, until the input itself blocks (slows down 
it’s consumption).

On a side note I was wondering: is there the same issue with file (hdfs) inputs 
? how can I be sure the input won’t “overflow” the process chain ?


From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: mardi 12 août 2014 02:58
To: Gwenhael Pasquiers
Cc: u...@spark.incubator.apache.org
Subject: Re: [spark-streaming] kafka source and flow control

Hi,

On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers 
gwenhael.pasqui...@ericsson.commailto:gwenhael.pasqui...@ericsson.com wrote:
We intend to apply other operations on the data later in the same spark 
context, but our first step is to archive it.

Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to hdfs AND send to step 3
Step 3 : transform data
Step 4 : save transformed data to HDFS as input for M/R

I see. Well I think Spark Streaming may be well suited for that purpose.

To us it looks like a great flaw if, in streaming mode, spark-streaming cannot 
slow down it’s consumption depending on the available resources.

On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers 
gwenhael.pasqui...@ericsson.commailto:gwenhael.pasqui...@ericsson.com wrote:
I think the kind of self-regulating system you describe would be too difficult 
to implement and probably unreliable (even more with the fact that we have 
multiple slaves).

Isn't slow down its consumption depending on the available resources a 
self-regulating system? I don't see how you can adapt to available resources 
without measuring your execution time and then change how much you consume. Did 
you have any particular form of adaption in mind?

Tobias


RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
Hi,

We intend to apply other operations on the data later in the same spark 
context, but our first step is to archive it.

Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to hdfs AND send to step 3
Step 3 : transform data
Step 4 : save transformed data to HDFS as input for M/R

Yes, maybe spark-streaming isn’t the tool adapted to our needs, but it looked 
like it so I wonder if I didn’t miss something.
To us it looks like a great flaw if, in streaming mode, spark-streaming cannot 
slow down it’s consumption depending on the available resources.

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: lundi 11 août 2014 11:44
To: Gwenhael Pasquiers
Subject: Re: [spark-streaming] kafka source and flow control

Hi,

On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers 
gwenhael.pasqui...@ericsson.commailto:gwenhael.pasqui...@ericsson.com wrote:
I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs.

I assume you are doing something else in between? If not, maybe a software such 
as Apache Flume may be better suited?

I have complete control over the kafka consumer since I developed a custom
Receiver as a workaround to :
https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow
control more intelligent than a simple rate limite (x messages or bytes per
second).

I’m interested in all ideas or suggestions.

I think what I would try to do is measuring how much data I can process within 
one time window (i.e., keep track of processing speed) and then (continuously?) 
adapt the data rate to something that I am capable of processing. In that case 
you would have to make sure that the data doesn't instead get lost within 
Kafka. After all, the problem seems to be that your HDFS is too slow and you'll 
have to buffer that data *somewhere*, right?

Tobias


RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
I didn’t reply to the last part of your message:

My source is Kafka, kafka already acts as a buffer with a lot of space.

So when I start my spark job, there is a lot of data to catch up (and it is 
critical not to lose any), but the kafka consumer goes as fast as it can (and 
it’s faster than my hdfs’s maximum).

I think the kind of self-regulating system you describe would be too difficult 
to implement and probably unreliable (even more with the fact that we have 
multiple slaves).



From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: lundi 11 août 2014 11:44
To: Gwenhael Pasquiers
Subject: Re: [spark-streaming] kafka source and flow control

Hi,

On Mon, Aug 11, 2014 at 6:19 PM, gpasquiers 
gwenhael.pasqui...@ericsson.commailto:gwenhael.pasqui...@ericsson.com wrote:
I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs.

I assume you are doing something else in between? If not, maybe a software such 
as Apache Flume may be better suited?

I have complete control over the kafka consumer since I developed a custom
Receiver as a workaround to :
https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow
control more intelligent than a simple rate limite (x messages or bytes per
second).

I’m interested in all ideas or suggestions.

I think what I would try to do is measuring how much data I can process within 
one time window (i.e., keep track of processing speed) and then (continuously?) 
adapt the data rate to something that I am capable of processing. In that case 
you would have to make sure that the data doesn't instead get lost within 
Kafka. After all, the problem seems to be that your HDFS is too slow and you'll 
have to buffer that data *somewhere*, right?

Tobias


Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Tobias Pfeiffer
Hi,

On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers 
gwenhael.pasqui...@ericsson.com wrote:

 We intend to apply other operations on the data later in the same spark
 context, but our first step is to archive it.



 Our goal is somth like this

 Step 1 : consume kafka
 Step 2 : archive to hdfs AND send to step 3
 Step 3 : transform data

 Step 4 : save transformed data to HDFS as input for M/R


I see. Well I think Spark Streaming may be well suited for that purpose.


 To us it looks like a great flaw if, in streaming mode, spark-streaming
 cannot slow down it’s consumption depending on the available resources.


On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers 
gwenhael.pasqui...@ericsson.com wrote:

 I think the kind of self-regulating system you describe would be too
 difficult to implement and probably unreliable (even more with the fact
 that we have multiple slaves).


Isn't slow down its consumption depending on the available resources a
self-regulating system? I don't see how you can adapt to available
resources without measuring your execution time and then change how much
you consume. Did you have any particular form of adaption in mind?

Tobias


Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Xuri Nagarin
In general, (and I am prototyping), I have a better idea :)
- Consume kafka in Spark from topic-A
- transform data in Spark (normalize, enrich etc etc)
- Feed it back to Kafka (in a different topic-B)
- Have flume-HDFS (for M/R, Impala, Spark batch) or Spark-streaming
or any other compute framework subscribe to B




On Mon, Aug 11, 2014 at 5:57 PM, Tobias Pfeiffer t...@preferred.jp wrote:
 Hi,

 On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers
 gwenhael.pasqui...@ericsson.com wrote:

 We intend to apply other operations on the data later in the same spark
 context, but our first step is to archive it.



 Our goal is somth like this

 Step 1 : consume kafka
 Step 2 : archive to hdfs AND send to step 3
 Step 3 : transform data

 Step 4 : save transformed data to HDFS as input for M/R


 I see. Well I think Spark Streaming may be well suited for that purpose.


 To us it looks like a great flaw if, in streaming mode, spark-streaming
 cannot slow down it’s consumption depending on the available resources.


 On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers
 gwenhael.pasqui...@ericsson.com wrote:

 I think the kind of self-regulating system you describe would be too
 difficult to implement and probably unreliable (even more with the fact that
 we have multiple slaves).


 Isn't slow down its consumption depending on the available resources a
 self-regulating system? I don't see how you can adapt to available
 resources without measuring your execution time and then change how much you
 consume. Did you have any particular form of adaption in mind?

 Tobias

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org