[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567367#comment-15567367 ]
Jeremy Smith commented on SPARK-17344: -------------------------------------- > By contrast, writing a streaming source shim around the existing simple > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't > have stuff like SSL, dynamic topics, or offset committing. Serious question: Would it be so bad to have a bifurcated codebase here? People who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for them, and are probably not all that concerned about the features you mentioned. In general, structured streaming already provides a lot of the capabilities that I for one am concerned about when using Kafka - offsets are tracked natively by SS, so offset committing isn't that big of a deal; in a CDH cluster specifically, you are probably using network-level security and aren't viewing the lack of SSL as a blocker; and finally you're already resigned to static topic subscriptions because that's what you're getting with the DStream API. A simple Structured Streaming source for Kafka, even using the same underlying technology, would be a HUGE step up: * You won't have "dynamic topics" to the same level, but at least you won't have to throw away all your checkpoints just to do something with a new topic in the same application. Currently, you have to do this, because the entire graph is stored in the checkpoints along with all the topics you're ever going to look at. Structured streaming at least gives you separate checkpoints per source, rather than for the entire StreamingContext. * You're already unable to manually commit offsets; you either have to rewind to the beginning, or throw away everything from the past, or (as before) rely on the incredibly fragile StreamingContext checkpoints. Or, commit the topic/partition/offset to the sink so you can recover the actually processed messages from there. Again, decoupling each operation from the entire state of the StreamingContext is a huge step up, because you can actually upgrade your application code (at least in certain ways) without having to worry about re-processing stuff due to discarding the checkpoints. * It will dramatically simplify the usage of Kafka from Spark in general. 9/10 use cases involve some sort of structured data, the processing of which will have dramatically better performance when being used with tungsten than with RDD-level operations. So if the simple-consumer based Kafka source would be so easy, at the expense of some features, why not introduce it? I have a tremendous amount of respect for the complexity of Kafka and the work you're doing with it, but I also get a sense that the conceptual "perfect" here is the enemy of the good. The weekend project you mentioned would result in a dramatic improvement in the experience for a large percentage of users who are currently using Spark and Kafka together. Most companies are using some kind of Hadoop distribution (i.e. HDP or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH doesn't), but at what rate are people actually able to update HDP? I don't have any data on it (ironically) but I'm guessing that 0.9 still represents a fairly significant portion of the Kafka install base. Just my two cents on the matter. > Kafka 0.8 support for Structured Streaming > ------------------------------------------ > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming > Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org