Re: KafkaInputDStream mapping of partitions to tasks

Evgeny Shishkin Thu, 27 Mar 2014 15:07:04 -0700

On 28 Mar 2014, at 00:34, Scott Clasen <scott.cla...@gmail.com> wrote:
Actually looking closer it is stranger than I thought,

in the spark UI, one executor has executed 4 tasks, and one has executed
1928....

Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to spark executors and tasks?

Well, there are some issues with kafkainput now.
When you do KafkaUtils.createStream, it creates kafka high level consumer on
one node!
I don’t really know how many rdd it will generate during batch window.
But when this rdd are created, spark schedules consecutive transformations on
that one node,
because of locality.

You can try to repartition() those rdds. Sometime it helps.

To try to consume from kafka on multiple machines you can do (1 to
N).map(KafkaUtils.createStream)
But then arises issue with kafka high-level consumer!
Those consumers operate in one consumer group, and they try to decide which
consumer consumes which partitions.
And it may just fail to do syncpartitionrebalance, and then you have only a few
consumers really consuming.
To mitigate this problem, you can set rebalance retries very high, and pray it
helps.

Then arises yet another feature — if your receiver dies (OOM, hardware failer),
you just stop receiving from kafka!
Brilliant.
And another feature — if you ask spark’s kafkainput to begin with
auto.offset.reset = smallest, it will reset you offsets every time you ran
application!
It does not comply with documentation (reset to earliest offsets if it does not
find offsets on zookeeper),
it just erase your offsets and start again from zero!

And remember that you should restart your streaming app when there is any
failure on receiver!

So, at the bottom — kafka input stream just does not work.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KafkaInputDStream mapping of partitions to tasks

Reply via email to