Hi, I’m new to this mailing list as well as spark-streaming.
I’m using spark-streaming in a cloudera environment to consume a kafka source and store all data into hdfs. There is a great volume of data and our issue is that the kafka consumer is going too fast for HDFS, it fills up the storage (memory) of spark-streaming, and makes it drop messages that have not been saved to HDFS yet… I’m not at all in an optimal environment regarding performances (spark running in a VM, accessing HDFS through an SSH tunnel), but I think it’s not a bad setup to test my system’s reliability. I’ve been (unsuccessfully) looking for a way to slow down my kafka consumer depending on the system’s health, maybe make the store() operation block if there is no available room for storage. I have complete control over the kafka consumer since I developed a custom Receiver as a workaround to : https://issues.apache.org/jira/browse/SPARK-2103 but i’d like to make a flow control more intelligent than a simple rate limite (x messages or bytes per second). I’m interested in all ideas or suggestions. B.R. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-source-and-flow-control-tp11879.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org