[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963938#comment-15963938
 ] 

Stephane Maarek commented on SPARK-20287:
-----------------------------------------

[~srowen] those are good points. In the case of 100 separate machines on 100 
tasks, then I agree you have 100 Kafka Consumers no matter what. I guess as you 
said, my optimisation would come when you have tasks on the same machine that 
could share a Kafka Consumer. 
My concern is as you said the number of connections opened to Kafka that might 
be high even if not needed. I understand one Kafka Consumer distributing to 
multiple tasks may bind them together on the receive, and I'm not a Spark 
expert so I can't measure the implications of that on performance. 

My concern then is with the spark.streaming.kafka.consumer.cache.maxCapacity 
parameter. Is that truly needed?
Say one executor consumes from 100 partitions, do we really need to have a 
maxCapacity parameter? The executor should just spin as many consumers as 
needed ?
Same, in a distributed context, can't the individual executors figure out how 
many kafka consumers they need? 

Thanks for the discussion, I appreciate it

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-20287
>                 URL: https://issues.apache.org/jira/browse/SPARK-20287
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to