[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966973#comment-15966973
 ] 

Stephane Maarek commented on SPARK-20287:
-----------------------------------------

[~c...@koeninger.org] 
How about using the subscribe pattern?
https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

```
public void subscribe(Collection<String> topics)
Subscribe to the given list of topics to get dynamically assigned partitions. 
Topic subscriptions are not incremental. This list will replace the current 
assignment (if there is one). It is not possible to combine topic subscription 
with group management with manual partition assignment through 
assign(Collection). If the given list of topics is empty, it is treated the 
same as unsubscribe().
```

Then you let Kafka handle the partition assignments? As all the consumers share 
the same group.id, the data will be effectively distributed between every Spark 
instance?

But then I guess you may have already explored that option and it goes against 
the Spark DirectStream API? (not a Spark expert, just trying to understand the 
limitations. I believe you when you say you did it the most straightforward way)

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-20287
>                 URL: https://issues.apache.org/jira/browse/SPARK-20287
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to