On 28 Mar 2014, at 00:34, Scott Clasen <scott.cla...@gmail.com> wrote:
Actually looking closer it is stranger than I thought,

in the spark UI, one executor has executed 4 tasks, and one has executed
1928....

Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to spark executors and tasks?


Well, there are some issues with kafkainput now.
When you do KafkaUtils.createStream, it creates kafka high level consumer on 
one node!
I don’t really know how many rdd it will generate during batch window.
But when this rdd are created, spark schedules consecutive transformations on 
that one node,
because of locality.

You can try to repartition() those rdds. Sometime it helps.

To try to consume from kafka on multiple machines you can do (1 to 
N).map(KafkaUtils.createStream)
But then arises issue with kafka high-level consumer! 
Those consumers operate in one consumer group, and they try to decide which 
consumer consumes which partitions.
And it may just fail to do syncpartitionrebalance, and then you have only a few 
consumers really consuming. 
To mitigate this problem, you can set rebalance retries very high, and pray it 
helps.

Then arises yet another feature — if your receiver dies (OOM, hardware failer), 
you just stop receiving from kafka!
Brilliant.
And another feature — if you ask spark’s kafkainput to begin with 
auto.offset.reset = smallest, it will reset you offsets every time you ran 
application!
It does not comply with documentation (reset to earliest offsets if it does not 
find offsets on zookeeper),
it just erase your offsets and start again from zero!

And remember that you should restart your streaming app when there is any 
failure on receiver!

So, at the bottom — kafka input stream just does not work.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to