[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Cody Koeninger (JIRA) Thu, 06 Oct 2016 20:09:34 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553955#comment-15553955
 ]


Cody Koeninger commented on SPARK-15406:
----------------------------------------

As soon as you say "checkpoint", a whole class of people who were burned by the 
problems with DStream checkpoints are going to tune out.  I don't trust 
anything unless I can get my hands on the offsets.

Regarding exactly once, this has two components, producing idempotently to 
kafka, and storing idempotently/transactionally downstream.  Specifically 
regarding producing to Kafka, I was talking with Jay Kreps about this 
yesterday.  There have been a lot of ideas over the years (e.g. 
put-with-expected-offset).  The top contender at this point is apparently 
something along the lines of a tcp/ip sequence number on the producer side 
allowing for repeated writes to be ignored on the broker side.  The KIP doc for 
it should be coming "soon".  Specifically regarding downstream write, without 
an actual implementation for real downstream systems and no way to get offsets 
except for individual messages, what exists currently couldn't even be tested 
in production at my current gig.  I could probably pretty quickly write a jdbc 
sink and a Citus sink, but those seem like they would require significant 
knowledge and / or assumptions about the job.  Having a schema helps a lot, but 
do you still want to assume things like a particular table format for outputs?  
That those sinks only work with Kafka?

> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
>                 Key: SPARK-15406
>                 URL: https://issues.apache.org/jira/browse/SPARK-15406
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Reply via email to