[jira] [Comment Edited] (SPARK-17344) Kafka 0.8 support for Structured Streaming

Jeremy Smith (JIRA) Tue, 11 Oct 2016 19:58:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567367#comment-15567367
 ]


Jeremy Smith edited comment on SPARK-17344 at 10/12/16 2:56 AM:
----------------------------------------------------------------

{quote}
By contrast, writing a streaming source shim around the existing simple 
consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have 
stuff like SSL, dynamic topics, or offset committing.
{quote}

Serious question: Would it be so bad to have a bifurcated codebase here? People 
who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for 
them, and are probably not all that concerned about the features you mentioned. 
In general, structured streaming already provides a lot of the capabilities 
that I for one am concerned about when using Kafka - offsets are tracked 
natively by SS, so offset committing isn't that big of a deal; in a CDH cluster 
specifically, you are probably using network-level security and aren't viewing 
the lack of SSL as a blocker; and finally you're already resigned to static 
topic subscriptions because that's what you're getting with the DStream API.

A simple Structured Streaming source for Kafka, even using the same underlying 
technology, would be a HUGE step up:

* You won't have "dynamic topics" to the same level, but at least you won't 
have to throw away all your checkpoints just to do something with a new topic 
in the same application. Currently, you have to do this, because the entire 
graph is stored in the checkpoints along with all the topics you're ever going 
to look at. Structured streaming at least gives you separate checkpoints per 
source, rather than for the entire StreamingContext.
* You're already unable to manually commit offsets; you either have to rewind 
to the beginning, or throw away everything from the past, or (as before) rely 
on the incredibly fragile StreamingContext checkpoints. Or, commit the 
topic/partition/offset to the sink so you can recover the actually processed 
messages from there. Again, decoupling each operation from the entire state of 
the StreamingContext is a huge step up, because you can actually upgrade your 
application code (at least in certain ways) without having to worry about 
re-processing stuff due to discarding the checkpoints.
* It will dramatically simplify the usage of Kafka from Spark in general. 9/10 
use cases involve some sort of structured data, the processing of which will 
have dramatically better performance when being used with tungsten than with 
RDD-level operations.

So if the simple-consumer based Kafka source would be so easy, at the expense 
of some features, why not introduce it? I have a tremendous amount of respect 
for the complexity of Kafka and the work you're doing with it, but I also get a 
sense that the conceptual "perfect" here is the enemy of the good. The weekend 
project you mentioned would result in a dramatic improvement in the experience 
for a large percentage of users who are currently using Spark and Kafka 
together. Most companies are using some kind of Hadoop distribution (i.e. HDP 
or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH 
doesn't), but at what rate are people actually able to update HDP? I don't have 
any data on it (ironically) but I'm guessing that 0.9 still represents a fairly 
significant portion of the Kafka install base.

Just my two cents on the matter.


was (Author: jeremyrsmith):
 > By contrast, writing a streaming source shim around the existing simple 
 > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't 
 > have stuff like SSL, dynamic topics, or offset committing.

Serious question: Would it be so bad to have a bifurcated codebase here? People 
who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for 
them, and are probably not all that concerned about the features you mentioned. 
In general, structured streaming already provides a lot of the capabilities 
that I for one am concerned about when using Kafka - offsets are tracked 
natively by SS, so offset committing isn't that big of a deal; in a CDH cluster 
specifically, you are probably using network-level security and aren't viewing 
the lack of SSL as a blocker; and finally you're already resigned to static 
topic subscriptions because that's what you're getting with the DStream API.

A simple Structured Streaming source for Kafka, even using the same underlying 
technology, would be a HUGE step up:

* You won't have "dynamic topics" to the same level, but at least you won't 
have to throw away all your checkpoints just to do something with a new topic 
in the same application. Currently, you have to do this, because the entire 
graph is stored in the checkpoints along with all the topics you're ever going 
to look at. Structured streaming at least gives you separate checkpoints per 
source, rather than for the entire StreamingContext.
* You're already unable to manually commit offsets; you either have to rewind 
to the beginning, or throw away everything from the past, or (as before) rely 
on the incredibly fragile StreamingContext checkpoints. Or, commit the 
topic/partition/offset to the sink so you can recover the actually processed 
messages from there. Again, decoupling each operation from the entire state of 
the StreamingContext is a huge step up, because you can actually upgrade your 
application code (at least in certain ways) without having to worry about 
re-processing stuff due to discarding the checkpoints.
* It will dramatically simplify the usage of Kafka from Spark in general. 9/10 
use cases involve some sort of structured data, the processing of which will 
have dramatically better performance when being used with tungsten than with 
RDD-level operations.

So if the simple-consumer based Kafka source would be so easy, at the expense 
of some features, why not introduce it? I have a tremendous amount of respect 
for the complexity of Kafka and the work you're doing with it, but I also get a 
sense that the conceptual "perfect" here is the enemy of the good. The weekend 
project you mentioned would result in a dramatic improvement in the experience 
for a large percentage of users who are currently using Spark and Kafka 
together. Most companies are using some kind of Hadoop distribution (i.e. HDP 
or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH 
doesn't), but at what rate are people actually able to update HDP? I don't have 
any data on it (ironically) but I'm guessing that 0.9 still represents a fairly 
significant portion of the Kafka install base.

Just my two cents on the matter.

> Kafka 0.8 support for Structured Streaming
> ------------------------------------------
>
>                 Key: SPARK-17344
>                 URL: https://issues.apache.org/jira/browse/SPARK-17344
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Streaming
>            Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17344) Kafka 0.8 support for Structured Streaming

Reply via email to