[GitHub] spark pull request #15367: [SPARK-17346][SQL] Add Kafka source for Structure...

zsxwing Wed, 05 Oct 2016 17:03:45 -0700

GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/15367


    [SPARK-17346][SQL] Add Kafka source for Structured Streaming (branch 2.0)

    ## What changes were proposed in this pull request?
    
    Backport 
https://github.com/apache/spark/commit/9293734d35eb3d6e4fd4ebb86f54dd5d3a35e6db 
into branch 2.0.
    
    The only difference is the Spark version in pom file.
    
    ## How was this patch tested?
    
    Jenkins.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark kafka-source-branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15367
    
----
commit 17e6b5c3fb702e66748634260670f4f843aa38fe
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-10-05T23:45:45Z

    [SPARK-17346][SQL] Add Kafka source for Structured Streaming
    
    ## What changes were proposed in this pull request?
    
    This PR adds a new project ` external/kafka-0-10-sql` for Structured 
Streaming Kafka source.
    
    It's based on the design doc: 
https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
    
    tdas did most of work and part of them was inspired by koeninger's work.
    
    ### Introduction
    
    The Kafka source is a structured streaming data source to poll data from 
Kafka. The schema of reading data is as follows:
    
    Column | Type
    ---- | ----
    key | binary
    value | binary
    topic | string
    partition | int
    offset | long
    timestamp | long
    timestampType | int
    
    The source can deal with deleting topics. However, the user should make 
sure there is no Spark job processing the data when deleting a topic.
    
    ### Configuration
    
    The user can use `DataStreamReader.option` to set the following 
configurations.
    
    Kafka Source's options | value | default | meaning
    ------ | ------- | ------ | -----
    startingOffset | ["earliest", "latest"] | "latest" | The start point when a 
query is started, either "earliest" which is from the earliest offset, or 
"latest" which is just from the latest offset. Note: This only applies when a 
new Streaming query is started, and that resuming will always pick up from 
where the query left off.
    failOnDataLost | [true, false] | true | Whether to fail the query when it's 
possible that data is lost (e.g., topics are deleted, or offsets are out of 
range). This may be a false alarm. You can disable it when it doesn't work as 
you expected.
    subscribe | A comma-separated list of topics | (none) | The topic list to 
subscribe. Only one of "subscribe" and "subscribeParttern" options can be 
specified for Kafka source.
    subscribePattern | Java regex string | (none) | The pattern used to 
subscribe the topic. Only one of "subscribe" and "subscribeParttern" options 
can be specified for Kafka source.
    kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to 
poll data from Kafka in executors
    fetchOffset.numRetries | int | 3 | Number of times to retry before giving 
up fatch Kafka latest offsets.
    fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before 
retrying to fetch Kafka offsets
    
    Kafka's own configurations can be set via `DataStreamReader.option` with 
`kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
    
    ### Usage
    
    * Subscribe to 1 topic
    ```Scala
    spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "host:port")
      .option("subscribe", "topic1")
      .load()
    ```
    
    * Subscribe to multiple topics
    ```Scala
    spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "host:port")
      .option("subscribe", "topic1,topic2")
      .load()
    ```
    
    * Subscribe to a pattern
    ```Scala
    spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "host:port")
      .option("subscribePattern", "topic.*")
      .load()
    ```
    
    ## How was this patch tested?
    
    The new unit tests.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    Author: Shixiong Zhu <zsxw...@gmail.com>
    Author: cody koeninger <c...@koeninger.org>
    
    Closes #15102 from zsxwing/kafka-source.

commit a560190dba618698ae2d9728e8b35c5954772467
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-10-06T00:02:12Z

    Fix version

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15367: [SPARK-17346][SQL] Add Kafka source for Structure...

Reply via email to