[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553613#comment-15553613 ]
Ofir Manor commented on SPARK-15406: ------------------------------------ I see three somewhat-related issues: 1. More flexibility is generally needed in stating offsets when using a Kafka source - that moved to SPARK-17812. 2. Regarding exactly-once, what I'm missing (given a transactional / idempotent sink) is the ability to restart a failed structured streaming job (after it crashed or was killed) by submitting a new one and asking to "copy" the context (offset and internal state) of another job / query. The programing guide hints that it is possible, but doesn't give an example (and I couldn't find a relevant method): "In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off". If that was possible, it would remove the need to manually monitor / manage source offsets for exactly-once. However, this is not specific to Kafka source - it is relevant for all fault-tolerant sources. 3. Another related issue (though out-of-scope of this JIRA) is adding an "exactly-once" Kafka sink. Since in Kafka we can't commit a batch of messages together (like in a Foreach sink), the open of [version,partition] can't just return a boolean - it should likely return the messages within a [version,partition] that were already written so only those will be filtered out (or otherwise filter within the partition before process() is called). That again is not Kafka-specific, will be useful in other non-transactional sinks, > Structured streaming support for consuming from Kafka > ----------------------------------------------------- > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature > Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > ================== Old description ========================= > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org