[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Michael Armbrust (JIRA) Thu, 06 Oct 2016 19:00:59 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553852#comment-15553852
 ]


Michael Armbrust commented on SPARK-15406:
------------------------------------------

Restarting a query is just running that query again, while passing in the same 
checkpoint location.  If the checkpoint location is empty, we start a new query 
(using the startingoffset parameter).  If not, we resume the query where we 
left off.

Regarding exactly-once to a system like kafka, I'm not sure that is possible in 
an efficient way (though please correct me if I'm missing something).  It seems 
like at some point you try and send a message and if that fails while this is 
happening (possibly because some external thing like the whole JVM dying) you 
can't know if it was received or not without scanning though the log.  As such, 
my thoughts on the kafka sink are to come up with a primary key, ensure 
at-least once delivery and the rely on the downstream to dedup.  The kafka docs 
cover this pretty well in their [semantics 
section|http://kafka.apache.org/documentation.html#semantics].  Incidentally, 
that section also explains how we implement exactly-once in our file sink.

> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
>                 Key: SPARK-15406
>                 URL: https://issues.apache.org/jira/browse/SPARK-15406
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Reply via email to