[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963920#comment-15963920
 ] 

Sean Owen commented on SPARK-20287:
-----------------------------------

Spark has a different execution model though where it does want to distribute 
processing of partitions, into logically separate (sometimes physically 
separate) tasks. It makes sense to consume one Kafka partition as one Spark 
partition. If you have 100 workers consuming 100 partitions but on 100 
different machines, there's no way to share those, right?

There might be some scope to use a single consumer to consume n Kafka 
partitions on behalf of n Spark tasks when they happen to be in one executor. 
Does that solve a problem though? you say you think it might be a big overhead, 
but can it be? the overhead sounds like more connections than might be needed 
otherwise. I could see that being a problem at thousands of tasks.

The flip-side is sharing has its own complexity and, I presume, bottlenecks 
that now bind tasks together. This could be problematic, but I haven't thought 
through the details.

I think you'd have to make more of a case it's a problem, and then propose a 
solution?

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-20287
>                 URL: https://issues.apache.org/jira/browse/SPARK-20287
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to