[jira] [Comment Edited] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

Mario Briggs (JIRA) Wed, 20 Jan 2016 10:17:29 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109056#comment-15109056
 ]


Mario Briggs edited comment on SPARK-12177 at 1/20/16 6:16 PM:
---------------------------------------------------------------

bq. I totally understand what you mean. However, kafka has its own assembly in 
Spark and the way the code is structured right now, both the new API and old 
API would go in the same assembly so it's important to have a different package 
name. Also, I think for our end users transitioning from old to new API, I 
foresee them having 2 versions of their spark-kafka app. One that works with 
the old API and one with the new API. And, I think it would be an easier 
transition if they could include both the kafka API versions in the spark 
classpath and pick and choose which app to run without mucking with maven 
dependencies and re-compiling when they want to switch. Let me know if you 
disagree.

Hey Mark,
Since this is WIP, i myself have atleast 1 more different option to what i 
suggested above... put just one to get the conversation rolling, so thanks for 
chipping in. Thanks for bringing out the kafka-assembly part… the assembly jar 
does include all dependencies too, so we would include kafka's v0.9’s jars i 
guess? I remember Cody mentioning that a v0.8 to v0.9 upgrade on kafka side 
involves upgrading the brokers… i think that is not required when client uses a 
v0.9 jar though consuming only the older high level/low level API and talking 
to a v0.8 kafka cluster. So we can go with 1 kafka-assembly and then my 
suggestion above itself is broken since we would have issues of same package & 
class names.

1 thought around not introducing the version in the package name or class name 
(I see that Flink does it in the class name) was to avoid forcing us to create 
v0.10/v0.11 packages (and customers to change code and recompile), even if 
those releases of kafka don’t have client-api’ or otherwise such changes that 
warrant us to make a new version (Also spark got away without putting a 
version# till now, which means less work in Spark, so not sure we want to start 
forcing this work going forward). Once we introduce the version #, we need to 
ensure it is in sync with kafka.

That’s why 1 earlier idea i mentioned in this JIRA was 'The public API 
signatures (of KafkaUtils in v0.9 subproject) are different and do not clash 
(with KafkaUtils in original kafka subproject) and hence can be added to the 
existing (original kafka subproject) KafkaUtils class.’  This also addresses 
the issues u mention above. Cody mentioned that we need to get others on the 
same page for this idea, so i guess we really need the committers to chime in 
here. Of course i forgot to answer’s Nikita’s followup question - 'do you mean 
that we would change the original KafkaUtils by adding new functions for new 
DirectIS/KafkaRDD but using them from separate module with kafka09 classes’ ?  
To be clear, these new public methods added to  original kafka subproject’s 
‘KafkaUtils' ,will  make use of 
DirectKafkaInputDStream,KafkaRDD,KafkaRDDPartition,OffsetRange classes that are 
in a new v09 package (internal of course).  In short we don’t have a new 
subproject. (I skipped class KafkaCluster class from the list, becuase i am 
thinking it makes more sense to call this class something like 'KafkaClient' 
instead going forward)


was (Author: mariobriggs):
bq. I totally understand what you mean. However, kafka has its own assembly in 
Spark and the way the code is structured right now, both the new API and old 
API would go in the same assembly so it's important to have a different package 
name. Also, I think for our end users transitioning from old to new API, I 
foresee them having 2 versions of their spark-kafka app. One that works with 
the old API and one with the new API. And, I think it would be an easier 
transition if they could include both the kafka API versions in the spark 
classpath and pick and choose which app to run without mucking with maven 
dependencies and re-compiling when they want to switch. Let me know if you 
disagree.

Since this is WIP, i myself have atleast 1 more different option to what i 
suggested above... put just one to get the conversation rolling, so thanks for 
chipping in. Thanks for bringing out the kafka-assembly part… the assembly jar 
does include all dependencies too, so we would include kafka's v0.9’s jars i 
guess? I remember Cody mentioning that a v0.8 to v0.9 upgrade on kafka side 
involves upgrading the brokers… i think that is not required when client uses a 
v0.9 jar though consuming only the older high level/low level API and talking 
to a v0.8 kafka cluster. So we can go with 1 kafka-assembly and then my 
suggestion above itself is broken since we would have issues of same package & 
class names.

1 thought around not introducing the version in the package name or class name 
(I see that Flink does it in the class name) was to avoid forcing us to create 
v0.10/v0.11 packages (and customers to change code and recompile), even if 
those releases of kafka don’t have client-api’ or otherwise such changes that 
warrant us to make a new version (Also spark got away without putting a 
version# till now, which means less work in Spark, so not sure we want to start 
forcing this work going forward). Once we introduce the version #, we need to 
ensure it is in sync with kafka.

That’s why 1 earlier idea i mentioned in this JIRA was 'The public API 
signatures (of KafkaUtils in v0.9 subproject) are different and do not clash 
(with KafkaUtils in original kafka subproject) and hence can be added to the 
existing (original kafka subproject) KafkaUtils class.’  Cody mentioned that we 
need to get others on the same page for this idea, so i guess we really need 
the committers to chime in here. Of course i forgot to answer’s Nikita’s 
followup question - 'do you mean that we would change the original KafkaUtils 
by adding new functions for new DirectIS/KafkaRDD but using them from separate 
module with kafka09 classes’ ?  To be clear, these new public methods added to  
original kafka subproject’s ‘KafkaUtils' ,will  make use of 
DirectKafkaInputDStream,KafkaRDD,KafkaRDDPartition,OffsetRange classes that are 
in a new v09 package (internal of course).  In short we don’t have a new 
subproject. (I skipped class KafkaCluster class from the list, becuase i am 
thinking it makes more sense to call this class something like 'KafkaClient' 
instead going forward)

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --------------------------------------------------
>
>                 Key: SPARK-12177
>                 URL: https://issues.apache.org/jira/browse/SPARK-12177
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.6.0
>            Reporter: Nikita Tarasenko
>              Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

Reply via email to