[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15682171#comment-15682171 ]
Cody Koeninger commented on SPARK-18475: ---------------------------------------- An iterator certainly does have an ordering guarantee, and it's pretty straightforward to figure out whether a given operation shuffles. Plenty of jobs have been written depending on that ordering guarantee, and it's documented for the Direct Stream. The only reason it's a significant performance improvement is because the OP is mis-using kafka. If he had reasonably even production into a reasonable number of partitions, there would be no performance improvement. You guys might be able to convince Michael this is a good idea, but as I said, this isn't the first time this has come up, and my answer isn't likely to change. I'm not "blocking" anything, I'm not a gatekeeper and have no more rights than you do. I just think it's a really bad idea. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > ------------------------------------------------------------------------------ > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 2.0.2, 2.1.0 > Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org