[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963939#comment-15963939
 ] 

Stephane Maarek commented on SPARK-20287:
-----------------------------------------

The other issue I can see is the coordinator work that has to re-coordinate XX 
number of Kafka Consumers should one go down. That's more expensive if you have 
100 consumers versus a few. But as you said, it should be performance 
limitation-driven, right now that'd be speculation. 

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-20287
>                 URL: https://issues.apache.org/jira/browse/SPARK-20287
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to