Re: KafkaInputDStream mapping of partitions to tasks

Aries Sun, 04 May 2014 19:04:26 -0700

Hi，
        Has anyone work on this?

Best
Aries


在 2014年3月30日，3:22，Nicolas Bär <[email protected]> 写道：

> Hi
> 
> Is there any workaround to this problem?
> 
> I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of 
> Kafka and handle the partition assignment manually. The easiest setup in this 
> case would be to bind the number of parallel jobs to the number of partitions 
> in Kafka. This is basically what Samza [2] does. I have a few questions 
> regarding this implementation:
> 
> The getReceiver method looks like a good starting point to implement the 
> manual partition assignment. Unfortunately it lacks documentation. As far as 
> I understood from the API, the getReceiver method is called once and passes 
> the received object (implementing the NetworkReceiver contract) to the worker 
> nodes. Therefore the logic to assign partitions to each receiver has to be 
> implemented within the receiver itself. I'm planning to implement the 
> following and have some questions in this regard:
> 1. within getReceiver: setup a zookeeper queue with the assignment of 
> partitions to parallel jobs and store the number of consumers needed
> - Is the number of parallel jobs accessible through ssc?
> 2. within onStart: poll the zookeeper queue and receive the partition 
> number(s) to receive data from
> 3. within onStart: start a new thread for keepalive messages to zookeeper. In 
> case a node goes down and a new receiver is started up again, the new 
> receiver can lookup zookeper to find the consumer without recent keepalives 
> and take it's place. Due to SPARK-1340 this might not even be possible yet. 
> Is there any insight on why the receiver is not started again?
> 
> Do you think the use of zookeeper in this scenario is a good approach or is 
> there an easier way using an integrated spark functionality? Also, by using 
> the SimpleConsumer API one has to manage the offset per consumer. The high 
> level consumer API solves this problem with zookeeper. I'd take the same 
> approach, if there's no easier way to handle this in spark.
> 
> 
> Best
> Nicolas
> 
> [1] 
> https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
> [2] 
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/task-runner.html
> 
> 
> On Fri, Mar 28, 2014 at 2:36 PM, Evgeniy Shishkin <[email protected]> 
> wrote:
> One more question,
> 
> we are using memory_and_disk_ser_2
> and i worried when those rdds on disk will be removed
> http://i.imgur.com/dbq5T6i.png
> 
> unpersist is set to true, and rdds get purged from memory, but disk space 
> just keep growing.
> 
> On 28 Mar 2014, at 01:32, Tathagata Das <[email protected]> wrote:
> 
> > Yes, no one has reported this issue before. I just opened a JIRA on what I 
> > think is the main problem here
> > https://spark-project.atlassian.net/browse/SPARK-1340
> > Some of the receivers dont get restarted.
> > I have a bunch refactoring in the NetworkReceiver ready to be posted as a 
> > PR that should fix this.
> >
> > Regarding the second problem, I have been thinking of adding flow control 
> > (i.e. limiting the rate of receiving) for a while, just havent gotten 
> > around to it.
> > I added another JIRA for that for tracking this issue.
> > https://spark-project.atlassian.net/browse/SPARK-1341
> >
> >
> > TD
> >
> >
> > On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <[email protected]> 
> > wrote:
> >
> > On 28 Mar 2014, at 01:11, Scott Clasen <[email protected]> wrote:
> >
> > > Evgeniy Shishkin wrote
> > >> So, at the bottom — kafka input stream just does not work.
> > >
> > >
> > > That was the conclusion I was coming to as well.  Are there open tickets
> > > around fixing this up?
> > >
> >
> > I am not aware of such. Actually nobody complained on spark+kafka before.
> > So i thought it just works, and then we tried to build something on it and 
> > almost failed.
> >
> > I think that it is possible to steal/replicate how twitter storm works with 
> > kafka.
> > They do manual partition assignment, at least this would help to balance 
> > load.
> >
> > There is another issue.
> > ssc batch creates new rdds every batch duration, always, even it previous 
> > computation did not finish.
> >
> > But with kafka, we can consume more rdds later, after we finish previous 
> > rdds.
> > That way it would be much much simpler to not get OOM’ed when starting from 
> > beginning,
> > because we can consume many data from kafka during batch duration and then 
> > get oom.
> >
> > But we just can not start slow, can not limit how many to consume during 
> > batch.
> >
> >
> > >
> > > --
> > > View this message in context: 
> > > http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
> > > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
> 
>

Re: KafkaInputDStream mapping of partitions to tasks

Reply via email to