Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17246#discussion_r106798265
  
    --- Diff: docs/structured-streaming-kafka-integration.md ---
    @@ -373,11 +374,213 @@ The following configurations are optional:
     </tr>
     </table>
     
    +## Writing Data to Kafka
    +
    +Here, we describe the support for writing Streaming Queries and Batch 
Queries to Apache Kafka. Take note that 
    +Apache Kafka only supports at least once write semantics. Consequently, 
when writing---either Streaming Queries
    +or Batch Queries---to Kafka, some records may be duplicated; this can 
happen, for example, if Kafka needs
    +to retry a message that was not acknowledged by a Broker, even though that 
Broker received and wrote the message record.
    +Structured Streaming cannot prevent such duplicates from occurring due to 
these Kafka write semantics. However, 
    +if writing the query is successful, then you can assume that the query 
output was written at least once. A possible
    +solution to remove duplicates when reading the written data could be to 
introduce a primary (unique) key 
    +that can be used to perform de-duplication when reading.
    +
    +Each row being written to Kafka has the following schema:
    --- End diff --
    
    The Dataframe being written to Kafka should have the following columns in 
the schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to