Hi,

I have a number of questions using the Kafka receiver of Spark
Streaming. Maybe someone has some more experience with that and can
help me out.

I have set up an environment for getting to know Spark, consisting of
- a Mesos cluster with 3 only-slaves and 3 master-and-slaves,
- 2 Kafka nodes,
- 3 Zookeeper nodes providing service to both Kafka and Mesos.

My Kafka cluster has only one topic with one partition (replicated to
both nodes). When I start my Kafka receiver, it successfully connects
to Kafka and does the processing, but it seems as if the (expensive)
function in the final foreachRDD(...) is only executed on one node of
my cluster, which is not what I had in mind when setting up the
cluster ;-)

So first, I was wondering about the parameter `topics: Map[String,
Int]` to KafkaUtils.createStream(). Apparently it controls how many
connections are made from my cluster nodes to Kafka. The Kafka doc at
https://kafka.apache.org/documentation.html#introduction says "each
message published to a topic is delivered to one consumer instance
within each subscribing consumer group" and "If all the consumer
instances have the same consumer group, then this works just like a
traditional queue balancing load over the consumers."

The Kafka docs *also* say: "Note however that there cannot be more
consumer instances than partitions." This seems to imply that with
only one partition, increasing the number in my Map should have no
effect.

However, if I increase the number of streams for my one topic in my
`topics` Map, I actually *do* see that the task in my foreachRDD(...)
call is now executed on multiple nodes. Maybe it's more of a Kafka
question than a Spark one, but can anyone explain this to me? Should I
always have more Kafka partitions than Mesos cluster nodes?

So, assuming that changing the number in that Map is not what I want
(although I don't know if it is), I tried to use
.repartition(numOfClusterNodes) (which doesn't seem right if I want to
add and remove Mesos nodes on demand). This *also* did spread the
foreachRDD(...) action evenly – however, the function never seems to
terminate, so I never get to process the next interval in the stream.
A similar behavior can be observed when running locally, not on the
cluster, then the program will not exit but instead hang after
everything else has shut down. Any hints concerning this issue?

Thanks
Tobias

Reply via email to