Hi, We have been using Spark Kafka streaming for real time processing with success. The scale of this stream has been increasing with data growth, and we have been able to scale up by adding more brokers to the Kafka cluster, adding more partitions to the topic, and adding more executors to the spark streaming app.
At this time our biggest topic has about 750 partitions. And in every mini batch of the streaming app, the driver will fetch the metadata from Kafka regarding this topic and arrange the tasks. I wonder will this step become a bottleneck, if we continue to scale in this way? Is there any best practices in scaling up the streaming job? Thanks, Charles