Yes, no one has reported this issue before. I just opened a JIRA on what I
think is the main problem here
https://spark-project.atlassian.net/browse/SPARK-1340
Some of the receivers dont get restarted.
I have a bunch refactoring in the NetworkReceiver ready to be posted as a
PR that should fix this.

Regarding the second problem, I have been thinking of adding flow control
(i.e. limiting the rate of receiving) for a while, just havent gotten
around to it.
I added another JIRA for that for tracking this issue.
https://spark-project.atlassian.net/browse/SPARK-1341


TD


On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <itparan...@gmail.com>wrote:

>
> On 28 Mar 2014, at 01:11, Scott Clasen <scott.cla...@gmail.com> wrote:
>
> > Evgeniy Shishkin wrote
> >> So, at the bottom -- kafka input stream just does not work.
> >
> >
> > That was the conclusion I was coming to as well.  Are there open tickets
> > around fixing this up?
> >
>
> I am not aware of such. Actually nobody complained on spark+kafka before.
> So i thought it just works, and then we tried to build something on it and
> almost failed.
>
> I think that it is possible to steal/replicate how twitter storm works
> with kafka.
> They do manual partition assignment, at least this would help to balance
> load.
>
> There is another issue.
> ssc batch creates new rdds every batch duration, always, even it previous
> computation did not finish.
>
> But with kafka, we can consume more rdds later, after we finish previous
> rdds.
> That way it would be much much simpler to not get OOM'ed when starting
> from beginning,
> because we can consume many data from kafka during batch duration and then
> get oom.
>
> But we just can not start slow, can not limit how many to consume during
> batch.
>
>
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Reply via email to