Burak Yavuz created SPARK-18475: ----------------------------------- Summary: Be able to provide higher parallelization for StructuredStreaming Kafka Source Key: SPARK-18475 URL: https://issues.apache.org/jira/browse/SPARK-18475 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.0.2, 2.1.0 Reporter: Burak Yavuz
Right now the StructuredStreaming Kafka Source creates as many Spark tasks as there are TopicPartitions that we're going to read from Kafka. This doesn't work well when we have data skew, and there is no reason why we shouldn't be able to increase parallelism further, i.e. have multiple Spark tasks reading from the same Kafka TopicPartition. What this will mean is that we won't be able to use the "CachedKafkaConsumer" for what it is defined for (being cached) in this use case, but the extra overhead is worth handling data skew and increasing parallelism especially in ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org