Hi,
Has anyone work on this?
Best
Aries
在 2014年3月30日,3:22,Nicolas Bär <[email protected]> 写道:
> Hi
>
> Is there any workaround to this problem?
>
> I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of
> Kafka and handle the partition assignment manually. The easiest setup in this
> case would be to bind the number of parallel jobs to the number of partitions
> in Kafka. This is basically what Samza [2] does. I have a few questions
> regarding this implementation:
>
> The getReceiver method looks like a good starting point to implement the
> manual partition assignment. Unfortunately it lacks documentation. As far as
> I understood from the API, the getReceiver method is called once and passes
> the received object (implementing the NetworkReceiver contract) to the worker
> nodes. Therefore the logic to assign partitions to each receiver has to be
> implemented within the receiver itself. I'm planning to implement the
> following and have some questions in this regard:
> 1. within getReceiver: setup a zookeeper queue with the assignment of
> partitions to parallel jobs and store the number of consumers needed
> - Is the number of parallel jobs accessible through ssc?
> 2. within onStart: poll the zookeeper queue and receive the partition
> number(s) to receive data from
> 3. within onStart: start a new thread for keepalive messages to zookeeper. In
> case a node goes down and a new receiver is started up again, the new
> receiver can lookup zookeper to find the consumer without recent keepalives
> and take it's place. Due to SPARK-1340 this might not even be possible yet.
> Is there any insight on why the receiver is not started again?
>
> Do you think the use of zookeeper in this scenario is a good approach or is
> there an easier way using an integrated spark functionality? Also, by using
> the SimpleConsumer API one has to manage the offset per consumer. The high
> level consumer API solves this problem with zookeeper. I'd take the same
> approach, if there's no easier way to handle this in spark.
>
>
> Best
> Nicolas
>
> [1]
> https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
> [2]
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/task-runner.html
>
>
> On Fri, Mar 28, 2014 at 2:36 PM, Evgeniy Shishkin <[email protected]>
> wrote:
> One more question,
>
> we are using memory_and_disk_ser_2
> and i worried when those rdds on disk will be removed
> http://i.imgur.com/dbq5T6i.png
>
> unpersist is set to true, and rdds get purged from memory, but disk space
> just keep growing.
>
> On 28 Mar 2014, at 01:32, Tathagata Das <[email protected]> wrote:
>
> > Yes, no one has reported this issue before. I just opened a JIRA on what I
> > think is the main problem here
> > https://spark-project.atlassian.net/browse/SPARK-1340
> > Some of the receivers dont get restarted.
> > I have a bunch refactoring in the NetworkReceiver ready to be posted as a
> > PR that should fix this.
> >
> > Regarding the second problem, I have been thinking of adding flow control
> > (i.e. limiting the rate of receiving) for a while, just havent gotten
> > around to it.
> > I added another JIRA for that for tracking this issue.
> > https://spark-project.atlassian.net/browse/SPARK-1341
> >
> >
> > TD
> >
> >
> > On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <[email protected]>
> > wrote:
> >
> > On 28 Mar 2014, at 01:11, Scott Clasen <[email protected]> wrote:
> >
> > > Evgeniy Shishkin wrote
> > >> So, at the bottom — kafka input stream just does not work.
> > >
> > >
> > > That was the conclusion I was coming to as well. Are there open tickets
> > > around fixing this up?
> > >
> >
> > I am not aware of such. Actually nobody complained on spark+kafka before.
> > So i thought it just works, and then we tried to build something on it and
> > almost failed.
> >
> > I think that it is possible to steal/replicate how twitter storm works with
> > kafka.
> > They do manual partition assignment, at least this would help to balance
> > load.
> >
> > There is another issue.
> > ssc batch creates new rdds every batch duration, always, even it previous
> > computation did not finish.
> >
> > But with kafka, we can consume more rdds later, after we finish previous
> > rdds.
> > That way it would be much much simpler to not get OOM’ed when starting from
> > beginning,
> > because we can consume many data from kafka during batch duration and then
> > get oom.
> >
> > But we just can not start slow, can not limit how many to consume during
> > batch.
> >
> >
> > >
> > > --
> > > View this message in context:
> > > http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
> > > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
>
>