GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/15909

    [SPARK-18475] Be able to increase parallelism in StructuredStreaming Kafka 
source

    ## What changes were proposed in this pull request?
    
    This PR adds the configuration `numPartitions` to the StructuredStreaming 
Kafka Source. Setting this value to a value higher than the number of 
`TopicPartitions` that you're going to consume will allow Spark to have 
multiple tasks reading from the same `TopicPartition` allowing users to handle 
skewed partitions.
    
    While the number of `TopicPartitions` could be dynamic from batch to batch, 
e.g. you may delete/create topics, in ETL use cases where you generally have a 
set of static number of TopicPartitions, this configuration has been very 
useful.
    
    If the `TopicPartitions` are dynamic, then we will always have a 
parallelism of `max(topicPartitions.length, numPartitions)`.
    
    ## How was this patch tested?
    
    Unit tests. I used this on production data and it certainly helped in 
handling peak loads and skewed partitions.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark split-kafka-partitions

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15909.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15909
    
----
commit f07bc5fbec10f4df2c990b97ae69f84d485a03b7
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-11-15T20:02:24Z

    kafka one-to-many mapping
    
    stop re-using consumers
    
    Close connections

commit 99e94d9676f382c585f6bfd5b1b2f853020d27c0
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-11-15T23:40:31Z

    Merge branch 'master' of github.com:apache/spark into split-kafka-partitions

commit 7e812ff68f6e89a66fa1ea2e7feba09892bd548b
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-11-16T01:01:01Z

    ready for testing

commit 3d0847d8bd4b9d62e79ae17b29b37de34dcfae62
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-11-16T21:27:24Z

    ready for review

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to