[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cody Koeninger updated SPARK-20287: ----------------------------------- What you're describing is closer to the receiver-based implementation, which had a number of issues. What I tried to achieve with the direct stream implementation was to have the driver figure out offset ranges for the next batch, then have executors deterministically consume exactly those messages with a 1:1 mapping between kafka partition and spark partition. If you have a single consumer subscribed to multiple topicpartitions, you'll get intermingled messages for all of those partitions. With the new consumer api subscribed to multiple partitions, there isn't a way to say "get topicpartition A until offset 1234", which is what we need. -- Cody Koeninger > Kafka Consumer should be able to subscribe to more than one topic partition > --------------------------------------------------------------------------- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 2.1.0 > Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org