Yes, no one has reported this issue before. I just opened a JIRA on what I think is the main problem here https://spark-project.atlassian.net/browse/SPARK-1340 Some of the receivers dont get restarted. I have a bunch refactoring in the NetworkReceiver ready to be posted as a PR that should fix this.
Regarding the second problem, I have been thinking of adding flow control (i.e. limiting the rate of receiving) for a while, just havent gotten around to it. I added another JIRA for that for tracking this issue. https://spark-project.atlassian.net/browse/SPARK-1341 TD On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <itparan...@gmail.com>wrote: > > On 28 Mar 2014, at 01:11, Scott Clasen <scott.cla...@gmail.com> wrote: > > > Evgeniy Shishkin wrote > >> So, at the bottom -- kafka input stream just does not work. > > > > > > That was the conclusion I was coming to as well. Are there open tickets > > around fixing this up? > > > > I am not aware of such. Actually nobody complained on spark+kafka before. > So i thought it just works, and then we tried to build something on it and > almost failed. > > I think that it is possible to steal/replicate how twitter storm works > with kafka. > They do manual partition assignment, at least this would help to balance > load. > > There is another issue. > ssc batch creates new rdds every batch duration, always, even it previous > computation did not finish. > > But with kafka, we can consume more rdds later, after we finish previous > rdds. > That way it would be much much simpler to not get OOM'ed when starting > from beginning, > because we can consume many data from kafka during batch duration and then > get oom. > > But we just can not start slow, can not limit how many to consume during > batch. > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >