[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553852#comment-15553852 ]
Michael Armbrust commented on SPARK-15406: ------------------------------------------ Restarting a query is just running that query again, while passing in the same checkpoint location. If the checkpoint location is empty, we start a new query (using the startingoffset parameter). If not, we resume the query where we left off. Regarding exactly-once to a system like kafka, I'm not sure that is possible in an efficient way (though please correct me if I'm missing something). It seems like at some point you try and send a message and if that fails while this is happening (possibly because some external thing like the whole JVM dying) you can't know if it was received or not without scanning though the log. As such, my thoughts on the kafka sink are to come up with a primary key, ensure at-least once delivery and the rely on the downstream to dedup. The kafka docs cover this pretty well in their [semantics section|http://kafka.apache.org/documentation.html#semantics]. Incidentally, that section also explains how we implement exactly-once in our file sink. > Structured streaming support for consuming from Kafka > ----------------------------------------------------- > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature > Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > ================== Old description ========================= > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org