Burak Yavuz created SPARK-18475:
-----------------------------------

             Summary: Be able to provide higher parallelization for 
StructuredStreaming Kafka Source
                 Key: SPARK-18475
                 URL: https://issues.apache.org/jira/browse/SPARK-18475
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.0.2, 2.1.0
            Reporter: Burak Yavuz


Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
there are TopicPartitions that we're going to read from Kafka.
This doesn't work well when we have data skew, and there is no reason why we 
shouldn't be able to increase parallelism further, i.e. have multiple Spark 
tasks reading from the same Kafka TopicPartition.

What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
for what it is defined for (being cached) in this use case, but the extra 
overhead is worth handling data skew and increasing parallelism especially in 
ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to