Hi Is there any workaround to this problem?
I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of Kafka and handle the partition assignment manually. The easiest setup in this case would be to bind the number of parallel jobs to the number of partitions in Kafka. This is basically what Samza [2] does. I have a few questions regarding this implementation: The getReceiver method looks like a good starting point to implement the manual partition assignment. Unfortunately it lacks documentation. As far as I understood from the API, the getReceiver method is called once and passes the received object (implementing the NetworkReceiver contract) to the worker nodes. Therefore the logic to assign partitions to each receiver has to be implemented within the receiver itself. I'm planning to implement the following and have some questions in this regard: 1. within getReceiver: setup a zookeeper queue with the assignment of partitions to parallel jobs and store the number of consumers needed - Is the number of parallel jobs accessible through ssc? 2. within onStart: poll the zookeeper queue and receive the partition number(s) to receive data from 3. within onStart: start a new thread for keepalive messages to zookeeper. In case a node goes down and a new receiver is started up again, the new receiver can lookup zookeper to find the consumer without recent keepalives and take it's place. Due to SPARK-1340 this might not even be possible yet. Is there any insight on why the receiver is not started again? Do you think the use of zookeeper in this scenario is a good approach or is there an easier way using an integrated spark functionality? Also, by using the SimpleConsumer API one has to manage the offset per consumer. The high level consumer API solves this problem with zookeeper. I'd take the same approach, if there's no easier way to handle this in spark. Best Nicolas [1] https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example [2] http://samza.incubator.apache.org/learn/documentation/0.7.0/container/task-runner.html On Fri, Mar 28, 2014 at 2:36 PM, Evgeniy Shishkin <itparan...@gmail.com>wrote: > One more question, > > we are using memory_and_disk_ser_2 > and i worried when those rdds on disk will be removed > http://i.imgur.com/dbq5T6i.png > > unpersist is set to true, and rdds get purged from memory, but disk space > just keep growing. > > On 28 Mar 2014, at 01:32, Tathagata Das <tathagata.das1...@gmail.com> > wrote: > > > Yes, no one has reported this issue before. I just opened a JIRA on what > I think is the main problem here > > https://spark-project.atlassian.net/browse/SPARK-1340 > > Some of the receivers dont get restarted. > > I have a bunch refactoring in the NetworkReceiver ready to be posted as > a PR that should fix this. > > > > Regarding the second problem, I have been thinking of adding flow > control (i.e. limiting the rate of receiving) for a while, just havent > gotten around to it. > > I added another JIRA for that for tracking this issue. > > https://spark-project.atlassian.net/browse/SPARK-1341 > > > > > > TD > > > > > > On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <itparan...@gmail.com> > wrote: > > > > On 28 Mar 2014, at 01:11, Scott Clasen <scott.cla...@gmail.com> wrote: > > > > > Evgeniy Shishkin wrote > > >> So, at the bottom -- kafka input stream just does not work. > > > > > > > > > That was the conclusion I was coming to as well. Are there open > tickets > > > around fixing this up? > > > > > > > I am not aware of such. Actually nobody complained on spark+kafka before. > > So i thought it just works, and then we tried to build something on it > and almost failed. > > > > I think that it is possible to steal/replicate how twitter storm works > with kafka. > > They do manual partition assignment, at least this would help to balance > load. > > > > There is another issue. > > ssc batch creates new rdds every batch duration, always, even it > previous computation did not finish. > > > > But with kafka, we can consume more rdds later, after we finish previous > rdds. > > That way it would be much much simpler to not get OOM'ed when starting > from beginning, > > because we can consume many data from kafka during batch duration and > then get oom. > > > > But we just can not start slow, can not limit how many to consume during > batch. > > > > > > > > > > -- > > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html > > > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > > > > > >