[ https://issues.apache.org/jira/browse/BEAM-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892746#comment-15892746 ]
peay edited comment on BEAM-1573 at 3/2/17 6:38 PM: ---------------------------------------------------- My concern is for both source and sink. I'd like to be able to use custom {{org.apache.kafka.common.serialization.Serializer,Deserializer}}. An example is http://docs.confluent.io/2.0.0/schema-registry/docs/serializer-formatter.html#serializer for working with setups where Kafka topics contain Avro serialized using an Avro schema registry. This uses a {{Serializer/Deserializer<Object>}} but I also have similar Kafka serializers with arbitrary types. The interfaces of the encoding/decoding methods in {{org.apache.kafka.common.serialization.Serializer,Deserializer}} are: - {{serialize(String topic, byte[] data)}} - {{deserialize(String topic, byte[] data)}}. I would like to be able to support a syntax like this: {code} KafkaIO .read() .withBootstrapServers(this.broker) .withTopics(ImmutableList.of(this.topic)) .withCustomKafkaValueDeserializerAndCoder(new SomeCustomKafkaDeserializer(), AvroCoder.of(xxx)) .withCustomKafkaKeyDeserializerAndCoder(new SomeCustomKafkaDeserializer()), AvroCoder.of(xxx)) KafkaIO .write() .withBootstrapServers(this.broker) .withTopic(this.topic) .withCustomKafkaValueSerializer(new SomeCustomDeserializer()) .withCustomKafkaKeySerializer(new SomeCustomDeserializer())) {code} In both case, Kafka would use the custom serializer/deserializer directly. Now, why is it hard to express currently? KafkaIO seems to be implemented differently for read and write, so let us consider the two cases. I have a working patch for the above syntax, that is straightforward for writes, but requires a bunch of changes for reads... For write, the Coder is wrapped into an actual {{org.apache.kafka.common.serialization.Serializer}} through {{CoderBasedKafkaSerializer}}. I can make a {{CustomCoder}}, but still have to pass it manually the topic name. Also, we end up with a wrapper for a Kafka serializer, wrapped in a Coder, itself wrapped in a Kafka serializer. Reads are implemented differently. I am not sure why? Instead of wrapping the coders into a Kafka deserializer, everything is hard wired to use `byte[]` Kafka consumer. Then, KakfaIO calls the coder after data has been returned by the consumer. Here also, one can make a {{CustomCoder}}. This won't work if a list of topics is used as input to KafkaIO, and still requires to pass in the topic name manually when there's only here. In the example snippet above, I also include a second argument that is a coder, to use with {{setCoder}} for setting up the rest of the pipeline. In both cases, wrapping the Kafka serializer into the Coder is also not obvious because Kafka serializers have a {{configure}} method to give them access to the consumer/producer config, so this possibly needs to be emulated in the coder wrapper. What do you think? was (Author: peay): My concern is for both source and sink. I'd like to be able to use custom {{org.apache.kafka.common.serialization.Serializer,Deserializer}}s. An example is http://docs.confluent.io/2.0.0/schema-registry/docs/serializer-formatter.html#serializer for working with setups where Kafka topics contain Avro serialized using an Avro schema registry. This uses a {{Serializer/Deserializer<Object>}} but I also have similar Kafka serializers with arbitrary types. The interfaces of the encoding/decoding methods in {{org.apache.kafka.common.serialization.Serializer,Deserializer}} are: - {{serialize(String topic, byte[] data)}} - {{deserialize(String topic, byte[] data)}}. I would like to be able to support a syntax like this: {code} KafkaIO .read() .withBootstrapServers(this.broker) .withTopics(ImmutableList.of(this.topic)) .withCustomKafkaValueDeserializerAndCoder(new SomeCustomKafkaDeserializer(), AvroCoder.of(xxx)) .withCustomKafkaKeyDeserializerAndCoder(new SomeCustomKafkaDeserializer()), AvroCoder.of(xxx)) KafkaIO .write() .withBootstrapServers(this.broker) .withTopic(this.topic) .withCustomKafkaValueSerializer(new SomeCustomDeserializer()) .withCustomKafkaKeySerializer(new SomeCustomDeserializer())) {code} In both case, Kafka would use the custom serializer/deserializer directly. Now, why is it hard to express currently? KafkaIO seems to be implemented differently for read and write, so let us consider the two cases. I have a working patch for the above syntax, that is straightforward for writes, but requires a bunch of changes for reads... For write, the Coder is wrapped into an actual {{org.apache.kafka.common.serialization.Serializer}} through {{CoderBasedKafkaSerializer}}. I can make a {{CustomCoder}}, but still have to pass it manually the topic name. Also, we end up with a wrapper for a Kafka serializer, wrapped in a Coder, itself wrapped in a Kafka serializer. Reads are implemented differently. I am not sure why? Instead of wrapping the coders into a Kafka deserializer, everything is hard wired to use `byte[]` Kafka consumer. Then, KakfaIO calls the coder after data has been returned by the consumer. Here also, one can make a {{CustomCoder}}. This won't work if a list of topics is used as input to KafkaIO, and still requires to pass in the topic name manually when there's only here. In the example snippet above, I also include a second argument that is a coder, to use with {{setCoder}} for setting up the rest of the pipeline. In both cases, wrapping the Kafka serializer into the Coder is also not obvious because Kafka serializers have a {{configure}} method to give them access to the consumer/producer config, so this possibly needs to be emulated in the coder wrapper. What do you think? > KafkaIO does not allow using Kafka serializers and deserializers > ---------------------------------------------------------------- > > Key: BEAM-1573 > URL: https://issues.apache.org/jira/browse/BEAM-1573 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions > Affects Versions: 0.4.0, 0.5.0 > Reporter: peay > Assignee: Raghu Angadi > Priority: Minor > > KafkaIO does not allow to override the serializer and deserializer settings > of the Kafka consumer and producers it uses internally. Instead, it allows to > set a `Coder`, and has a simple Kafka serializer/deserializer wrapper class > that calls the coder. > I appreciate that allowing to use Beam coders is good and consistent with the > rest of the system. However, is there a reason to completely disallow to use > custom Kafka serializers instead? > This is a limitation when working with an Avro schema registry for instance, > which requires custom serializers. One can write a `Coder` that wraps a > custom Kafka serializer, but that means two levels of un-necessary wrapping. > In addition, the `Coder` abstraction is not equivalent to Kafka's > `Serializer` which gets the topic name as input. Using a `Coder` wrapper > would require duplicating the output topic setting in the argument to > `KafkaIO` and when building the wrapper, which is not elegant and error prone. -- This message was sent by Atlassian JIRA (v6.3.15#6346)