Hi,

I’m new to this mailing list as well as spark-streaming.

I’m using spark-streaming in a cloudera environment to consume a kafka
source and store all data into hdfs. There is a great volume of data and our
issue is that the kafka consumer is going too fast for HDFS, it fills up the
storage (memory) of spark-streaming, and makes it drop messages that have
not been saved to HDFS yet…

I’m not at all in an optimal environment regarding performances (spark
running in a VM, accessing HDFS through an SSH tunnel), but I think it’s not
a bad setup to test my system’s reliability.

I’ve been (unsuccessfully) looking for a way to slow down my kafka consumer
depending on the system’s health, maybe make the store() operation block if
there is no available room for storage.

I have complete control over the kafka consumer since I developed a custom
Receiver as a workaround to :
https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow
control more intelligent than a simple rate limite (x messages or bytes per
second).

I’m interested in all ideas or suggestions.

B.R.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-source-and-flow-control-tp11879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to