[jira] [Created] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Burak Yavuz (JIRA) Wed, 16 Nov 2016 14:44:05 -0800

Burak Yavuz created SPARK-18475:
-----------------------------------

             Summary: Be able to provide higher parallelization for 
StructuredStreaming Kafka Source
                 Key: SPARK-18475
                 URL: https://issues.apache.org/jira/browse/SPARK-18475
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.0.2, 2.1.0
            Reporter: Burak Yavuz



Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
there are TopicPartitions that we're going to read from Kafka.
This doesn't work well when we have data skew, and there is no reason why we 
shouldn't be able to increase parallelism further, i.e. have multiple Spark 
tasks reading from the same Kafka TopicPartition.

What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
for what it is defined for (being cached) in this use case, but the extra 
overhead is worth handling data skew and increasing parallelism especially in 
ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Reply via email to