Stephane Maarek created SPARK-20287: ---------------------------------------
Summary: Kafka Consumer should be able to subscribe to more than one topic partition Key: SPARK-20287 URL: https://issues.apache.org/jira/browse/SPARK-20287 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Stephane Maarek As I understand and as it stands, one Kafka Consumer is created for each topic partition in the source Kafka topics, and they're cached. cf https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 In my opinion, that makes the design an anti pattern for Kafka and highly unefficient: - Each Kafka consumer creates a connection to Kafka - Spark doesn't leverage the power of the Kafka consumers, which is that it automatically assigns and balances partitions amongst all the consumers that share the same group.id - You can still cache your Kafka consumer even if it has multiple partitions. I'm not sure about how that translates to the spark underlying RDD architecture, but from a Kafka standpoint, I believe creating one consumer per partition is a big overhead, and a risk as the user may have to increase the spark.streaming.kafka.consumer.cache.maxCapacity parameter. Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org