On 28 Mar 2014, at 00:34, Scott Clasen <scott.cla...@gmail.com> wrote: Actually looking closer it is stranger than I thought,
in the spark UI, one executor has executed 4 tasks, and one has executed 1928.... Can anyone explain the workings of a KafkaInputStream wrt kafka partitions and mapping to spark executors and tasks? Well, there are some issues with kafkainput now. When you do KafkaUtils.createStream, it creates kafka high level consumer on one node! I don’t really know how many rdd it will generate during batch window. But when this rdd are created, spark schedules consecutive transformations on that one node, because of locality. You can try to repartition() those rdds. Sometime it helps. To try to consume from kafka on multiple machines you can do (1 to N).map(KafkaUtils.createStream) But then arises issue with kafka high-level consumer! Those consumers operate in one consumer group, and they try to decide which consumer consumes which partitions. And it may just fail to do syncpartitionrebalance, and then you have only a few consumers really consuming. To mitigate this problem, you can set rebalance retries very high, and pray it helps. Then arises yet another feature — if your receiver dies (OOM, hardware failer), you just stop receiving from kafka! Brilliant. And another feature — if you ask spark’s kafkainput to begin with auto.offset.reset = smallest, it will reset you offsets every time you ran application! It does not comply with documentation (reset to earliest offsets if it does not find offsets on zookeeper), it just erase your offsets and start again from zero! And remember that you should restart your streaming app when there is any failure on receiver! So, at the bottom — kafka input stream just does not work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3374.html Sent from the Apache Spark User List mailing list archive at Nabble.com.