[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Cody Koeninger (JIRA) Sun, 20 Nov 2016 17:18:33 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15682171#comment-15682171
 ]


Cody Koeninger commented on SPARK-18475:
----------------------------------------

An iterator certainly does have an ordering guarantee, and it's pretty 
straightforward to figure out whether a given operation shuffles.  Plenty of 
jobs have been written depending on that ordering guarantee, and it's 
documented for the Direct Stream.

The only reason it's a significant performance improvement is because the OP is 
mis-using kafka.  If he had reasonably even production into a reasonable number 
of partitions, there would be no performance improvement.

You guys might be able to convince Michael this is a good idea, but as I said, 
this isn't the first time this has come up, and my answer isn't likely to 
change.  I'm not "blocking" anything, I'm not a gatekeeper and have no more 
rights than you do.  I just think it's a really bad idea.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Reply via email to