GitHub user brkyvz opened a pull request: https://github.com/apache/spark/pull/15909
[SPARK-18475] Be able to increase parallelism in StructuredStreaming Kafka source ## What changes were proposed in this pull request? This PR adds the configuration `numPartitions` to the StructuredStreaming Kafka Source. Setting this value to a value higher than the number of `TopicPartitions` that you're going to consume will allow Spark to have multiple tasks reading from the same `TopicPartition` allowing users to handle skewed partitions. While the number of `TopicPartitions` could be dynamic from batch to batch, e.g. you may delete/create topics, in ETL use cases where you generally have a set of static number of TopicPartitions, this configuration has been very useful. If the `TopicPartitions` are dynamic, then we will always have a parallelism of `max(topicPartitions.length, numPartitions)`. ## How was this patch tested? Unit tests. I used this on production data and it certainly helped in handling peak loads and skewed partitions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/brkyvz/spark split-kafka-partitions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15909.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15909 ---- commit f07bc5fbec10f4df2c990b97ae69f84d485a03b7 Author: Burak Yavuz <brk...@gmail.com> Date: 2016-11-15T20:02:24Z kafka one-to-many mapping stop re-using consumers Close connections commit 99e94d9676f382c585f6bfd5b1b2f853020d27c0 Author: Burak Yavuz <brk...@gmail.com> Date: 2016-11-15T23:40:31Z Merge branch 'master' of github.com:apache/spark into split-kafka-partitions commit 7e812ff68f6e89a66fa1ea2e7feba09892bd548b Author: Burak Yavuz <brk...@gmail.com> Date: 2016-11-16T01:01:01Z ready for testing commit 3d0847d8bd4b9d62e79ae17b29b37de34dcfae62 Author: Burak Yavuz <brk...@gmail.com> Date: 2016-11-16T21:27:24Z ready for review ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org