Hey, Spark 1.1.0 Kafka 0.8.1.1 Hadoop (YARN/HDFS) 2.5.1
I have a five partition Kafka topic. I can create a single Kafka receiver via KafkaUtils.createStream with five threads in the topic map and consume messages fine. Sifting through the user list and Google, I see that its possible to split the Kafka receiver among the Spark workers such that I can have a receiver per topic, and have this distributed to workers rather than localized to the driver. I’m following something like this: https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java#L132 But for Kafka obviously. From the Streaming Programming Guide “ Receiving multiple data streams can therefore be achieved by creating multiple input DStreams and configuring them to receive different partitions of the data stream from the source(s)." However, I’m not able to consume any messages from Kafka after I perform the union operation. Again, if I create a single, multi-threaded, receiver I can consume messages fine. If I create 5 receivers in a loop, and call jssc.union(…) i get: INFO scheduler.ReceiverTracker: Stream 0 received 0 blocks INFO scheduler.ReceiverTracker: Stream 1 received 0 blocks INFO scheduler.ReceiverTracker: Stream 2 received 0 blocks INFO scheduler.ReceiverTracker: Stream 3 received 0 blocks INFO scheduler.ReceiverTracker: Stream 4 received 0 blocks Do I need to do anything to the unioned DStream? Am I going about this incorrectly? Thanks in advance. Matt