[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Ofir Manor (JIRA) Thu, 17 Nov 2016 08:11:25 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674092#comment-15674092
 ]


Ofir Manor commented on SPARK-18475:
------------------------------------

Are you sure this is working? Having a visible perf effect?
As far as I know, the maximum parallelism of Kafka is the number of 
topic-partitions, by design. If your consumer group has more consumers than 
that, some of them will be just idle. This is because when reading, each 
partition is owned by a single consumer (that allocation of partitions to 
consumer is dynamic, as consumers joins and leaves). 
To quote an older source:
??The first thing to understand is that a topic partition is the unit of 
parallelism in Kafka. On both the producer and the broker side, writes to 
different partitions can be done fully in parallel. So expensive operations 
such as compression can utilize more hardware resources. On the consumer side, 
Kafka always gives a single partition’s data to one consumer thread. Thus, the 
degree of parallelism in the consumer (within a consumer group) is bounded by 
the number of partitions being consumed. Therefore, in general, the more 
partitions there are in a Kafka cluster, the higher the throughput one can 
achieve.??
https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

You could repartition the data in Spark after reading, to increase parallelism 
of Spark's processing.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Reply via email to