Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-24 Thread Jay Kreps
This is admittedly late in the release cycle to make a change. To add to
Jun's description the motivation was that we felt it would be better to
change that interface now rather than after the release if it needed to
change.

The motivation for wanting to make a change was the ability to really be
able to develop support for Avro and other serialization formats. The
current status is pretty scattered--there is a schema repository on an Avro
JIRA and another fork of that on github, and a bunch of people we have
talked to have done similar things for other serialization systems. It
would be nice if these things could be packaged in such a way that it was
possible to just change a few configs in the producer and get rich metadata
support for messages.

As we were thinking this through we realized that the new api we were about
to introduce was kind of not very compatable with this since it was just
byte[] oriented.

You can always do this by adding some kind of wrapper api that wraps the
producer. But this puts us back in the position of trying to document and
support multiple interfaces.

This also opens up the possibility of adding a MessageValidator or
MessageInterceptor plug-in transparently so that you can do other custom
validation on the messages you are sending which obviously requires access
to the original object not the byte array.

This api doesn't prevent using byte[] by configuring the
ByteArraySerializer it works as it currently does.

-Jay

On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:

> Hi, Everyone,
>
> I'd like to start a discussion on whether it makes sense to add the
> serializer api back to the new java producer. Currently, the new java
> producer takes a byte array for both the key and the value. While this api
> is simple, it pushes the serialization logic into the application. This
> makes it hard to reason about what type of data is being sent to Kafka and
> also makes it hard to share an implementation of the serializer. For
> example, to support Avro, the serialization logic could be quite involved
> since it might need to register the Avro schema in some remote registry and
> maintain a schema cache locally, etc. Without a serialization api, it's
> impossible to share such an implementation so that people can easily reuse.
> We sort of overlooked this implication during the initial discussion of the
> producer api.
>
> So, I'd like to propose an api change to the new producer by adding back
> the serializer api similar to what we had in the old producer. Specially,
> the proposed api changes are the following.
>
> First, we change KafkaProducer to take generic types K and V for the key
> and the value, respectively.
>
> public class KafkaProducer implements Producer {
>
> public Future send(ProducerRecord record, Callback
> callback);
>
> public Future send(ProducerRecord record);
> }
>
> Second, we add two new configs, one for the key serializer and another for
> the value serializer. Both serializers will default to the byte array
> implementation.
>
> public class ProducerConfig extends AbstractConfig {
>
> .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> KEY_SERIALIZER_CLASS_DOC)
> .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> VALUE_SERIALIZER_CLASS_DOC);
> }
>
> Both serializers will implement the following interface.
>
> public interface Serializer extends Configurable {
> public byte[] serialize(String topic, T data, boolean isKey);
>
> public void close();
> }
>
> This is more or less the same as what's in the old producer. The slight
> differences are (1) the serializer now only requires a parameter-less
> constructor; (2) the serializer has a configure() and a close() method for
> initialization and cleanup, respectively; (3) the serialize() method
> additionally takes the topic and an isKey indicator, both of which are
> useful for things like schema registration.
>
> The detailed changes are included in KAFKA-1797. For completeness, I also
> made the corresponding changes for the new java consumer api as well.
>
> Note that the proposed api changes are incompatible with what's in the
> 0.8.2 branch. However, if those api changes are beneficial, it's probably
> better to include them now in the 0.8.2 release, rather than later.
>
> I'd like to discuss mainly two things in this thread.
> 1. Do people feel that the proposed api changes are reasonable?
> 2. Are there any concerns of including the api changes in the 0.8.2 final
> release?
>
> Thanks,
>
> Jun
>


Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-24 Thread Gwen Shapira
As one of the people who spent too much time building Avro repositories, +1
on bringing serializer API back.

I think it will make the new producer easier to work with.

Gwen

On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps  wrote:

> This is admittedly late in the release cycle to make a change. To add to
> Jun's description the motivation was that we felt it would be better to
> change that interface now rather than after the release if it needed to
> change.
>
> The motivation for wanting to make a change was the ability to really be
> able to develop support for Avro and other serialization formats. The
> current status is pretty scattered--there is a schema repository on an Avro
> JIRA and another fork of that on github, and a bunch of people we have
> talked to have done similar things for other serialization systems. It
> would be nice if these things could be packaged in such a way that it was
> possible to just change a few configs in the producer and get rich metadata
> support for messages.
>
> As we were thinking this through we realized that the new api we were about
> to introduce was kind of not very compatable with this since it was just
> byte[] oriented.
>
> You can always do this by adding some kind of wrapper api that wraps the
> producer. But this puts us back in the position of trying to document and
> support multiple interfaces.
>
> This also opens up the possibility of adding a MessageValidator or
> MessageInterceptor plug-in transparently so that you can do other custom
> validation on the messages you are sending which obviously requires access
> to the original object not the byte array.
>
> This api doesn't prevent using byte[] by configuring the
> ByteArraySerializer it works as it currently does.
>
> -Jay
>
> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
>
> > Hi, Everyone,
> >
> > I'd like to start a discussion on whether it makes sense to add the
> > serializer api back to the new java producer. Currently, the new java
> > producer takes a byte array for both the key and the value. While this
> api
> > is simple, it pushes the serialization logic into the application. This
> > makes it hard to reason about what type of data is being sent to Kafka
> and
> > also makes it hard to share an implementation of the serializer. For
> > example, to support Avro, the serialization logic could be quite involved
> > since it might need to register the Avro schema in some remote registry
> and
> > maintain a schema cache locally, etc. Without a serialization api, it's
> > impossible to share such an implementation so that people can easily
> reuse.
> > We sort of overlooked this implication during the initial discussion of
> the
> > producer api.
> >
> > So, I'd like to propose an api change to the new producer by adding back
> > the serializer api similar to what we had in the old producer. Specially,
> > the proposed api changes are the following.
> >
> > First, we change KafkaProducer to take generic types K and V for the key
> > and the value, respectively.
> >
> > public class KafkaProducer implements Producer {
> >
> > public Future send(ProducerRecord record,
> Callback
> > callback);
> >
> > public Future send(ProducerRecord record);
> > }
> >
> > Second, we add two new configs, one for the key serializer and another
> for
> > the value serializer. Both serializers will default to the byte array
> > implementation.
> >
> > public class ProducerConfig extends AbstractConfig {
> >
> > .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> > "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> > KEY_SERIALIZER_CLASS_DOC)
> > .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> > "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> > VALUE_SERIALIZER_CLASS_DOC);
> > }
> >
> > Both serializers will implement the following interface.
> >
> > public interface Serializer extends Configurable {
> > public byte[] serialize(String topic, T data, boolean isKey);
> >
> > public void close();
> > }
> >
> > This is more or less the same as what's in the old producer. The slight
> > differences are (1) the serializer now only requires a parameter-less
> > constructor; (2) the serializer has a configure() and a close() method
> for
> > initialization and cleanup, respectively; (3) the serialize() method
> > additionally takes the topic and an isKey indicator, both of which are
> > useful for things like schema registration.
> >
> > The detailed changes are included in KAFKA-1797. For completeness, I also
> > made the corresponding changes for the new java consumer api as well.
> >
> > Note that the proposed api changes are incompatible with what's in the
> > 0.8.2 branch. However, if those api changes are beneficial, it's probably
> > better to include them now in the 0.8.2 release, rather than later.
> >
> > I'd like to discuss mainly two things in this thread.
> > 1. Do people feel that the proposed api changes are reasonable?
> >

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-24 Thread Sriram Subramanian
Looked at the patch. +1 from me.

On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:

>As one of the people who spent too much time building Avro repositories,
>+1
>on bringing serializer API back.
>
>I think it will make the new producer easier to work with.
>
>Gwen
>
>On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps  wrote:
>
>> This is admittedly late in the release cycle to make a change. To add to
>> Jun's description the motivation was that we felt it would be better to
>> change that interface now rather than after the release if it needed to
>> change.
>>
>> The motivation for wanting to make a change was the ability to really be
>> able to develop support for Avro and other serialization formats. The
>> current status is pretty scattered--there is a schema repository on an
>>Avro
>> JIRA and another fork of that on github, and a bunch of people we have
>> talked to have done similar things for other serialization systems. It
>> would be nice if these things could be packaged in such a way that it
>>was
>> possible to just change a few configs in the producer and get rich
>>metadata
>> support for messages.
>>
>> As we were thinking this through we realized that the new api we were
>>about
>> to introduce was kind of not very compatable with this since it was just
>> byte[] oriented.
>>
>> You can always do this by adding some kind of wrapper api that wraps the
>> producer. But this puts us back in the position of trying to document
>>and
>> support multiple interfaces.
>>
>> This also opens up the possibility of adding a MessageValidator or
>> MessageInterceptor plug-in transparently so that you can do other custom
>> validation on the messages you are sending which obviously requires
>>access
>> to the original object not the byte array.
>>
>> This api doesn't prevent using byte[] by configuring the
>> ByteArraySerializer it works as it currently does.
>>
>> -Jay
>>
>> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
>>
>> > Hi, Everyone,
>> >
>> > I'd like to start a discussion on whether it makes sense to add the
>> > serializer api back to the new java producer. Currently, the new java
>> > producer takes a byte array for both the key and the value. While this
>> api
>> > is simple, it pushes the serialization logic into the application.
>>This
>> > makes it hard to reason about what type of data is being sent to Kafka
>> and
>> > also makes it hard to share an implementation of the serializer. For
>> > example, to support Avro, the serialization logic could be quite
>>involved
>> > since it might need to register the Avro schema in some remote
>>registry
>> and
>> > maintain a schema cache locally, etc. Without a serialization api,
>>it's
>> > impossible to share such an implementation so that people can easily
>> reuse.
>> > We sort of overlooked this implication during the initial discussion
>>of
>> the
>> > producer api.
>> >
>> > So, I'd like to propose an api change to the new producer by adding
>>back
>> > the serializer api similar to what we had in the old producer.
>>Specially,
>> > the proposed api changes are the following.
>> >
>> > First, we change KafkaProducer to take generic types K and V for the
>>key
>> > and the value, respectively.
>> >
>> > public class KafkaProducer implements Producer {
>> >
>> > public Future send(ProducerRecord record,
>> Callback
>> > callback);
>> >
>> > public Future send(ProducerRecord record);
>> > }
>> >
>> > Second, we add two new configs, one for the key serializer and another
>> for
>> > the value serializer. Both serializers will default to the byte array
>> > implementation.
>> >
>> > public class ProducerConfig extends AbstractConfig {
>> >
>> > .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
>> > "org.apache.kafka.clients.producer.ByteArraySerializer",
>>Importance.HIGH,
>> > KEY_SERIALIZER_CLASS_DOC)
>> > .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
>> > "org.apache.kafka.clients.producer.ByteArraySerializer",
>>Importance.HIGH,
>> > VALUE_SERIALIZER_CLASS_DOC);
>> > }
>> >
>> > Both serializers will implement the following interface.
>> >
>> > public interface Serializer extends Configurable {
>> > public byte[] serialize(String topic, T data, boolean isKey);
>> >
>> > public void close();
>> > }
>> >
>> > This is more or less the same as what's in the old producer. The
>>slight
>> > differences are (1) the serializer now only requires a parameter-less
>> > constructor; (2) the serializer has a configure() and a close() method
>> for
>> > initialization and cleanup, respectively; (3) the serialize() method
>> > additionally takes the topic and an isKey indicator, both of which are
>> > useful for things like schema registration.
>> >
>> > The detailed changes are included in KAFKA-1797. For completeness, I
>>also
>> > made the corresponding changes for the new java consumer api as well.
>> >
>> > Note that the proposed api changes are incompatible with what's in the
>> > 0.8.2 branch. However, if those api changes are 

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Shlomi Hazan
Jun, while just a humble user, I would like to recall that it was just 6
days ago that you told me on the user list that the producer is stable when
I asked what producer to go with and if the new producer is production
stable (you can still see that email down the list).
maybe I miss something, but for me, stable includes the API.
So it looks rather too big and too late from where I am standing to make
this change now. this kind of change will introduce generics, add major
mandatory interface, and make the whole producer more complicated then it
really has to be when you consider only Kafka and not Avro.
I can see the obvious benefits for the many other use cases, but once you
declare something stable it is usually expected that the API will not
change unless something really big was discovered.
Now it may be the case that you discovered something big enough and so
personally I will not make a vote.
If the benefits make the change justifiable is for you guys to decide.
Shlomi

On Tue, Nov 25, 2014 at 6:43 AM, Sriram Subramanian <
srsubraman...@linkedin.com.invalid> wrote:

> Looked at the patch. +1 from me.
>
> On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:
>
> >As one of the people who spent too much time building Avro repositories,
> >+1
> >on bringing serializer API back.
> >
> >I think it will make the new producer easier to work with.
> >
> >Gwen
> >
> >On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps  wrote:
> >
> >> This is admittedly late in the release cycle to make a change. To add to
> >> Jun's description the motivation was that we felt it would be better to
> >> change that interface now rather than after the release if it needed to
> >> change.
> >>
> >> The motivation for wanting to make a change was the ability to really be
> >> able to develop support for Avro and other serialization formats. The
> >> current status is pretty scattered--there is a schema repository on an
> >>Avro
> >> JIRA and another fork of that on github, and a bunch of people we have
> >> talked to have done similar things for other serialization systems. It
> >> would be nice if these things could be packaged in such a way that it
> >>was
> >> possible to just change a few configs in the producer and get rich
> >>metadata
> >> support for messages.
> >>
> >> As we were thinking this through we realized that the new api we were
> >>about
> >> to introduce was kind of not very compatable with this since it was just
> >> byte[] oriented.
> >>
> >> You can always do this by adding some kind of wrapper api that wraps the
> >> producer. But this puts us back in the position of trying to document
> >>and
> >> support multiple interfaces.
> >>
> >> This also opens up the possibility of adding a MessageValidator or
> >> MessageInterceptor plug-in transparently so that you can do other custom
> >> validation on the messages you are sending which obviously requires
> >>access
> >> to the original object not the byte array.
> >>
> >> This api doesn't prevent using byte[] by configuring the
> >> ByteArraySerializer it works as it currently does.
> >>
> >> -Jay
> >>
> >> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
> >>
> >> > Hi, Everyone,
> >> >
> >> > I'd like to start a discussion on whether it makes sense to add the
> >> > serializer api back to the new java producer. Currently, the new java
> >> > producer takes a byte array for both the key and the value. While this
> >> api
> >> > is simple, it pushes the serialization logic into the application.
> >>This
> >> > makes it hard to reason about what type of data is being sent to Kafka
> >> and
> >> > also makes it hard to share an implementation of the serializer. For
> >> > example, to support Avro, the serialization logic could be quite
> >>involved
> >> > since it might need to register the Avro schema in some remote
> >>registry
> >> and
> >> > maintain a schema cache locally, etc. Without a serialization api,
> >>it's
> >> > impossible to share such an implementation so that people can easily
> >> reuse.
> >> > We sort of overlooked this implication during the initial discussion
> >>of
> >> the
> >> > producer api.
> >> >
> >> > So, I'd like to propose an api change to the new producer by adding
> >>back
> >> > the serializer api similar to what we had in the old producer.
> >>Specially,
> >> > the proposed api changes are the following.
> >> >
> >> > First, we change KafkaProducer to take generic types K and V for the
> >>key
> >> > and the value, respectively.
> >> >
> >> > public class KafkaProducer implements Producer {
> >> >
> >> > public Future send(ProducerRecord record,
> >> Callback
> >> > callback);
> >> >
> >> > public Future send(ProducerRecord record);
> >> > }
> >> >
> >> > Second, we add two new configs, one for the key serializer and another
> >> for
> >> > the value serializer. Both serializers will default to the byte array
> >> > implementation.
> >> >
> >> > public class ProducerConfig extends AbstractConfig {
> >> >
> >> > .define(KEY_SERIALI

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Jonathan Weeks
+1 on this change — APIs are forever. As much as we’d love to see 0.8.2 release 
ASAP, it is important to get this right.

-JW

> On Nov 24, 2014, at 5:58 PM, Jun Rao  wrote:
> 
> Hi, Everyone,
> 
> I'd like to start a discussion on whether it makes sense to add the
> serializer api back to the new java producer. Currently, the new java
> producer takes a byte array for both the key and the value. While this api
> is simple, it pushes the serialization logic into the application. This
> makes it hard to reason about what type of data is being sent to Kafka and
> also makes it hard to share an implementation of the serializer. For
> example, to support Avro, the serialization logic could be quite involved
> since it might need to register the Avro schema in some remote registry and
> maintain a schema cache locally, etc. Without a serialization api, it's
> impossible to share such an implementation so that people can easily reuse.
> We sort of overlooked this implication during the initial discussion of the
> producer api.
> 
> So, I'd like to propose an api change to the new producer by adding back
> the serializer api similar to what we had in the old producer. Specially,
> the proposed api changes are the following.
> 
> First, we change KafkaProducer to take generic types K and V for the key
> and the value, respectively.
> 
> public class KafkaProducer implements Producer {
> 
>public Future send(ProducerRecord record, Callback
> callback);
> 
>public Future send(ProducerRecord record);
> }
> 
> Second, we add two new configs, one for the key serializer and another for
> the value serializer. Both serializers will default to the byte array
> implementation.
> 
> public class ProducerConfig extends AbstractConfig {
> 
>.define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> KEY_SERIALIZER_CLASS_DOC)
>.define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> VALUE_SERIALIZER_CLASS_DOC);
> }
> 
> Both serializers will implement the following interface.
> 
> public interface Serializer extends Configurable {
>public byte[] serialize(String topic, T data, boolean isKey);
> 
>public void close();
> }
> 
> This is more or less the same as what's in the old producer. The slight
> differences are (1) the serializer now only requires a parameter-less
> constructor; (2) the serializer has a configure() and a close() method for
> initialization and cleanup, respectively; (3) the serialize() method
> additionally takes the topic and an isKey indicator, both of which are
> useful for things like schema registration.
> 
> The detailed changes are included in KAFKA-1797. For completeness, I also
> made the corresponding changes for the new java consumer api as well.
> 
> Note that the proposed api changes are incompatible with what's in the
> 0.8.2 branch. However, if those api changes are beneficial, it's probably
> better to include them now in the 0.8.2 release, rather than later.
> 
> I'd like to discuss mainly two things in this thread.
> 1. Do people feel that the proposed api changes are reasonable?
> 2. Are there any concerns of including the api changes in the 0.8.2 final
> release?
> 
> Thanks,
> 
> Jun



Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Joe Stein
The serializer is an expected use of the producer/consumer now and think we
should continue that support in the new client. As far as breaking the API
it is why we released the 0.8.2-beta to help get through just these type of
blocking issues in a way that the community at large could be involved in
easier with a build/binaries to download and use from maven also.

+1 on the change now prior to the 0.8.2 release.

- Joe Stein


On Mon, Nov 24, 2014 at 11:43 PM, Sriram Subramanian <
srsubraman...@linkedin.com.invalid> wrote:

> Looked at the patch. +1 from me.
>
> On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:
>
> >As one of the people who spent too much time building Avro repositories,
> >+1
> >on bringing serializer API back.
> >
> >I think it will make the new producer easier to work with.
> >
> >Gwen
> >
> >On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps  wrote:
> >
> >> This is admittedly late in the release cycle to make a change. To add to
> >> Jun's description the motivation was that we felt it would be better to
> >> change that interface now rather than after the release if it needed to
> >> change.
> >>
> >> The motivation for wanting to make a change was the ability to really be
> >> able to develop support for Avro and other serialization formats. The
> >> current status is pretty scattered--there is a schema repository on an
> >>Avro
> >> JIRA and another fork of that on github, and a bunch of people we have
> >> talked to have done similar things for other serialization systems. It
> >> would be nice if these things could be packaged in such a way that it
> >>was
> >> possible to just change a few configs in the producer and get rich
> >>metadata
> >> support for messages.
> >>
> >> As we were thinking this through we realized that the new api we were
> >>about
> >> to introduce was kind of not very compatable with this since it was just
> >> byte[] oriented.
> >>
> >> You can always do this by adding some kind of wrapper api that wraps the
> >> producer. But this puts us back in the position of trying to document
> >>and
> >> support multiple interfaces.
> >>
> >> This also opens up the possibility of adding a MessageValidator or
> >> MessageInterceptor plug-in transparently so that you can do other custom
> >> validation on the messages you are sending which obviously requires
> >>access
> >> to the original object not the byte array.
> >>
> >> This api doesn't prevent using byte[] by configuring the
> >> ByteArraySerializer it works as it currently does.
> >>
> >> -Jay
> >>
> >> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
> >>
> >> > Hi, Everyone,
> >> >
> >> > I'd like to start a discussion on whether it makes sense to add the
> >> > serializer api back to the new java producer. Currently, the new java
> >> > producer takes a byte array for both the key and the value. While this
> >> api
> >> > is simple, it pushes the serialization logic into the application.
> >>This
> >> > makes it hard to reason about what type of data is being sent to Kafka
> >> and
> >> > also makes it hard to share an implementation of the serializer. For
> >> > example, to support Avro, the serialization logic could be quite
> >>involved
> >> > since it might need to register the Avro schema in some remote
> >>registry
> >> and
> >> > maintain a schema cache locally, etc. Without a serialization api,
> >>it's
> >> > impossible to share such an implementation so that people can easily
> >> reuse.
> >> > We sort of overlooked this implication during the initial discussion
> >>of
> >> the
> >> > producer api.
> >> >
> >> > So, I'd like to propose an api change to the new producer by adding
> >>back
> >> > the serializer api similar to what we had in the old producer.
> >>Specially,
> >> > the proposed api changes are the following.
> >> >
> >> > First, we change KafkaProducer to take generic types K and V for the
> >>key
> >> > and the value, respectively.
> >> >
> >> > public class KafkaProducer implements Producer {
> >> >
> >> > public Future send(ProducerRecord record,
> >> Callback
> >> > callback);
> >> >
> >> > public Future send(ProducerRecord record);
> >> > }
> >> >
> >> > Second, we add two new configs, one for the key serializer and another
> >> for
> >> > the value serializer. Both serializers will default to the byte array
> >> > implementation.
> >> >
> >> > public class ProducerConfig extends AbstractConfig {
> >> >
> >> > .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> >> > "org.apache.kafka.clients.producer.ByteArraySerializer",
> >>Importance.HIGH,
> >> > KEY_SERIALIZER_CLASS_DOC)
> >> > .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> >> > "org.apache.kafka.clients.producer.ByteArraySerializer",
> >>Importance.HIGH,
> >> > VALUE_SERIALIZER_CLASS_DOC);
> >> > }
> >> >
> >> > Both serializers will implement the following interface.
> >> >
> >> > public interface Serializer extends Configurable {
> >> > public byte[] serialize(String topic, T data, boolean isKey);
> >> >
> >>

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Manikumar Reddy
+1 for this change.

what about de-serializer  class in 0.8.2?  Say i am using new producer with
Avro and old consumer combination.
then i need to give custom Decoder implementation for Avro right?.

On Tue, Nov 25, 2014 at 9:19 PM, Joe Stein  wrote:

> The serializer is an expected use of the producer/consumer now and think we
> should continue that support in the new client. As far as breaking the API
> it is why we released the 0.8.2-beta to help get through just these type of
> blocking issues in a way that the community at large could be involved in
> easier with a build/binaries to download and use from maven also.
>
> +1 on the change now prior to the 0.8.2 release.
>
> - Joe Stein
>
>
> On Mon, Nov 24, 2014 at 11:43 PM, Sriram Subramanian <
> srsubraman...@linkedin.com.invalid> wrote:
>
> > Looked at the patch. +1 from me.
> >
> > On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:
> >
> > >As one of the people who spent too much time building Avro repositories,
> > >+1
> > >on bringing serializer API back.
> > >
> > >I think it will make the new producer easier to work with.
> > >
> > >Gwen
> > >
> > >On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps  wrote:
> > >
> > >> This is admittedly late in the release cycle to make a change. To add
> to
> > >> Jun's description the motivation was that we felt it would be better
> to
> > >> change that interface now rather than after the release if it needed
> to
> > >> change.
> > >>
> > >> The motivation for wanting to make a change was the ability to really
> be
> > >> able to develop support for Avro and other serialization formats. The
> > >> current status is pretty scattered--there is a schema repository on an
> > >>Avro
> > >> JIRA and another fork of that on github, and a bunch of people we have
> > >> talked to have done similar things for other serialization systems. It
> > >> would be nice if these things could be packaged in such a way that it
> > >>was
> > >> possible to just change a few configs in the producer and get rich
> > >>metadata
> > >> support for messages.
> > >>
> > >> As we were thinking this through we realized that the new api we were
> > >>about
> > >> to introduce was kind of not very compatable with this since it was
> just
> > >> byte[] oriented.
> > >>
> > >> You can always do this by adding some kind of wrapper api that wraps
> the
> > >> producer. But this puts us back in the position of trying to document
> > >>and
> > >> support multiple interfaces.
> > >>
> > >> This also opens up the possibility of adding a MessageValidator or
> > >> MessageInterceptor plug-in transparently so that you can do other
> custom
> > >> validation on the messages you are sending which obviously requires
> > >>access
> > >> to the original object not the byte array.
> > >>
> > >> This api doesn't prevent using byte[] by configuring the
> > >> ByteArraySerializer it works as it currently does.
> > >>
> > >> -Jay
> > >>
> > >> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
> > >>
> > >> > Hi, Everyone,
> > >> >
> > >> > I'd like to start a discussion on whether it makes sense to add the
> > >> > serializer api back to the new java producer. Currently, the new
> java
> > >> > producer takes a byte array for both the key and the value. While
> this
> > >> api
> > >> > is simple, it pushes the serialization logic into the application.
> > >>This
> > >> > makes it hard to reason about what type of data is being sent to
> Kafka
> > >> and
> > >> > also makes it hard to share an implementation of the serializer. For
> > >> > example, to support Avro, the serialization logic could be quite
> > >>involved
> > >> > since it might need to register the Avro schema in some remote
> > >>registry
> > >> and
> > >> > maintain a schema cache locally, etc. Without a serialization api,
> > >>it's
> > >> > impossible to share such an implementation so that people can easily
> > >> reuse.
> > >> > We sort of overlooked this implication during the initial discussion
> > >>of
> > >> the
> > >> > producer api.
> > >> >
> > >> > So, I'd like to propose an api change to the new producer by adding
> > >>back
> > >> > the serializer api similar to what we had in the old producer.
> > >>Specially,
> > >> > the proposed api changes are the following.
> > >> >
> > >> > First, we change KafkaProducer to take generic types K and V for the
> > >>key
> > >> > and the value, respectively.
> > >> >
> > >> > public class KafkaProducer implements Producer {
> > >> >
> > >> > public Future send(ProducerRecord record,
> > >> Callback
> > >> > callback);
> > >> >
> > >> > public Future send(ProducerRecord record);
> > >> > }
> > >> >
> > >> > Second, we add two new configs, one for the key serializer and
> another
> > >> for
> > >> > the value serializer. Both serializers will default to the byte
> array
> > >> > implementation.
> > >> >
> > >> > public class ProducerConfig extends AbstractConfig {
> > >> >
> > >> > .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> > >> > "org.apach

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Bhavesh Mistry
How will mix bag will work with Consumer side ?  Entire site can not be
rolled at once so Consumer will have to deals with New and Old Serialize
Bytes ?  This could be app team responsibility.  Are you guys targeting
0.8.2 release, which may break customer who are already using new producer
API (beta version).

Thanks,

Bhavesh

On Tue, Nov 25, 2014 at 8:29 AM, Manikumar Reddy 
wrote:

> +1 for this change.
>
> what about de-serializer  class in 0.8.2?  Say i am using new producer with
> Avro and old consumer combination.
> then i need to give custom Decoder implementation for Avro right?.
>
> On Tue, Nov 25, 2014 at 9:19 PM, Joe Stein  wrote:
>
> > The serializer is an expected use of the producer/consumer now and think
> we
> > should continue that support in the new client. As far as breaking the
> API
> > it is why we released the 0.8.2-beta to help get through just these type
> of
> > blocking issues in a way that the community at large could be involved in
> > easier with a build/binaries to download and use from maven also.
> >
> > +1 on the change now prior to the 0.8.2 release.
> >
> > - Joe Stein
> >
> >
> > On Mon, Nov 24, 2014 at 11:43 PM, Sriram Subramanian <
> > srsubraman...@linkedin.com.invalid> wrote:
> >
> > > Looked at the patch. +1 from me.
> > >
> > > On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:
> > >
> > > >As one of the people who spent too much time building Avro
> repositories,
> > > >+1
> > > >on bringing serializer API back.
> > > >
> > > >I think it will make the new producer easier to work with.
> > > >
> > > >Gwen
> > > >
> > > >On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps 
> wrote:
> > > >
> > > >> This is admittedly late in the release cycle to make a change. To
> add
> > to
> > > >> Jun's description the motivation was that we felt it would be better
> > to
> > > >> change that interface now rather than after the release if it needed
> > to
> > > >> change.
> > > >>
> > > >> The motivation for wanting to make a change was the ability to
> really
> > be
> > > >> able to develop support for Avro and other serialization formats.
> The
> > > >> current status is pretty scattered--there is a schema repository on
> an
> > > >>Avro
> > > >> JIRA and another fork of that on github, and a bunch of people we
> have
> > > >> talked to have done similar things for other serialization systems.
> It
> > > >> would be nice if these things could be packaged in such a way that
> it
> > > >>was
> > > >> possible to just change a few configs in the producer and get rich
> > > >>metadata
> > > >> support for messages.
> > > >>
> > > >> As we were thinking this through we realized that the new api we
> were
> > > >>about
> > > >> to introduce was kind of not very compatable with this since it was
> > just
> > > >> byte[] oriented.
> > > >>
> > > >> You can always do this by adding some kind of wrapper api that wraps
> > the
> > > >> producer. But this puts us back in the position of trying to
> document
> > > >>and
> > > >> support multiple interfaces.
> > > >>
> > > >> This also opens up the possibility of adding a MessageValidator or
> > > >> MessageInterceptor plug-in transparently so that you can do other
> > custom
> > > >> validation on the messages you are sending which obviously requires
> > > >>access
> > > >> to the original object not the byte array.
> > > >>
> > > >> This api doesn't prevent using byte[] by configuring the
> > > >> ByteArraySerializer it works as it currently does.
> > > >>
> > > >> -Jay
> > > >>
> > > >> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
> > > >>
> > > >> > Hi, Everyone,
> > > >> >
> > > >> > I'd like to start a discussion on whether it makes sense to add
> the
> > > >> > serializer api back to the new java producer. Currently, the new
> > java
> > > >> > producer takes a byte array for both the key and the value. While
> > this
> > > >> api
> > > >> > is simple, it pushes the serialization logic into the application.
> > > >>This
> > > >> > makes it hard to reason about what type of data is being sent to
> > Kafka
> > > >> and
> > > >> > also makes it hard to share an implementation of the serializer.
> For
> > > >> > example, to support Avro, the serialization logic could be quite
> > > >>involved
> > > >> > since it might need to register the Avro schema in some remote
> > > >>registry
> > > >> and
> > > >> > maintain a schema cache locally, etc. Without a serialization api,
> > > >>it's
> > > >> > impossible to share such an implementation so that people can
> easily
> > > >> reuse.
> > > >> > We sort of overlooked this implication during the initial
> discussion
> > > >>of
> > > >> the
> > > >> > producer api.
> > > >> >
> > > >> > So, I'd like to propose an api change to the new producer by
> adding
> > > >>back
> > > >> > the serializer api similar to what we had in the old producer.
> > > >>Specially,
> > > >> > the proposed api changes are the following.
> > > >> >
> > > >> > First, we change KafkaProducer to take generic types K and V for
> th

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Jay Kreps
Hey Shlomi,

I agree that we just blew this one from a timing perspective. We ideally
should have thought this through in the original api discussion. But as we
really started to think about this area we realized that the existing api
made it really hard to provide a simple way package of serialization and
data model stuff. I have heard that there is a saying that the best time to
plant a tree is a generation ago, but the second best time is right now.
And I think this is kind of along those lines.

This was really brought home to me as for the last month or so I have been
going around and talking to a lot of people using Kafka and essentially
every one of them has had to make some kind of wrapper api. There is
nothing so terrible about these wrappers except that they make it hard to
have central documentation that explains how the system works, and they
usually strip off a lot of the functionality of the client, so you always
have to learn the in-house wrapper and can't really do everything you could
do with the main client. Since all the wrappers were trying to provide a
few things: serialization, message validation, etc. All of these depend on
having access to the original object. I think if we make this change on
serialization we can later add any additional hooks for message validation
with no compatibility problems.

-Jay

On Tue, Nov 25, 2014 at 12:12 AM, Shlomi Hazan  wrote:

> Jun, while just a humble user, I would like to recall that it was just 6
> days ago that you told me on the user list that the producer is stable when
> I asked what producer to go with and if the new producer is production
> stable (you can still see that email down the list).
> maybe I miss something, but for me, stable includes the API.
> So it looks rather too big and too late from where I am standing to make
> this change now. this kind of change will introduce generics, add major
> mandatory interface, and make the whole producer more complicated then it
> really has to be when you consider only Kafka and not Avro.
> I can see the obvious benefits for the many other use cases, but once you
> declare something stable it is usually expected that the API will not
> change unless something really big was discovered.
> Now it may be the case that you discovered something big enough and so
> personally I will not make a vote.
> If the benefits make the change justifiable is for you guys to decide.
> Shlomi
>
> On Tue, Nov 25, 2014 at 6:43 AM, Sriram Subramanian <
> srsubraman...@linkedin.com.invalid> wrote:
>
> > Looked at the patch. +1 from me.
> >
> > On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:
> >
> > >As one of the people who spent too much time building Avro repositories,
> > >+1
> > >on bringing serializer API back.
> > >
> > >I think it will make the new producer easier to work with.
> > >
> > >Gwen
> > >
> > >On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps  wrote:
> > >
> > >> This is admittedly late in the release cycle to make a change. To add
> to
> > >> Jun's description the motivation was that we felt it would be better
> to
> > >> change that interface now rather than after the release if it needed
> to
> > >> change.
> > >>
> > >> The motivation for wanting to make a change was the ability to really
> be
> > >> able to develop support for Avro and other serialization formats. The
> > >> current status is pretty scattered--there is a schema repository on an
> > >>Avro
> > >> JIRA and another fork of that on github, and a bunch of people we have
> > >> talked to have done similar things for other serialization systems. It
> > >> would be nice if these things could be packaged in such a way that it
> > >>was
> > >> possible to just change a few configs in the producer and get rich
> > >>metadata
> > >> support for messages.
> > >>
> > >> As we were thinking this through we realized that the new api we were
> > >>about
> > >> to introduce was kind of not very compatable with this since it was
> just
> > >> byte[] oriented.
> > >>
> > >> You can always do this by adding some kind of wrapper api that wraps
> the
> > >> producer. But this puts us back in the position of trying to document
> > >>and
> > >> support multiple interfaces.
> > >>
> > >> This also opens up the possibility of adding a MessageValidator or
> > >> MessageInterceptor plug-in transparently so that you can do other
> custom
> > >> validation on the messages you are sending which obviously requires
> > >>access
> > >> to the original object not the byte array.
> > >>
> > >> This api doesn't prevent using byte[] by configuring the
> > >> ByteArraySerializer it works as it currently does.
> > >>
> > >> -Jay
> > >>
> > >> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
> > >>
> > >> > Hi, Everyone,
> > >> >
> > >> > I'd like to start a discussion on whether it makes sense to add the
> > >> > serializer api back to the new java producer. Currently, the new
> java
> > >> > producer takes a byte array for both the key and the value. While
> this
> > >> api
>

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Jun Rao
Shiomi,

Sorry, at that time, I didn't realize that we would be better off with an
api change. Yes, it sucks that we have to break the api. However, if we
have to change it, it's better to do it now rather than later.

Note that if you want to just produce byte[] to Kafka, you can still do
that with the api change. You just need to bind the producer with byte[]
and the default serializer will just work. Yes, there needs to be code
changes. My hope is that right now no one has adopted the new producer api
widely and making such code changes is not very painful yet.

Thanks,

Jun

On Tue, Nov 25, 2014 at 12:12 AM, Shlomi Hazan  wrote:

> Jun, while just a humble user, I would like to recall that it was just 6
> days ago that you told me on the user list that the producer is stable when
> I asked what producer to go with and if the new producer is production
> stable (you can still see that email down the list).
> maybe I miss something, but for me, stable includes the API.
> So it looks rather too big and too late from where I am standing to make
> this change now. this kind of change will introduce generics, add major
> mandatory interface, and make the whole producer more complicated then it
> really has to be when you consider only Kafka and not Avro.
> I can see the obvious benefits for the many other use cases, but once you
> declare something stable it is usually expected that the API will not
> change unless something really big was discovered.
> Now it may be the case that you discovered something big enough and so
> personally I will not make a vote.
> If the benefits make the change justifiable is for you guys to decide.
> Shlomi
>
> On Tue, Nov 25, 2014 at 6:43 AM, Sriram Subramanian <
> srsubraman...@linkedin.com.invalid> wrote:
>
> > Looked at the patch. +1 from me.
> >
> > On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:
> >
> > >As one of the people who spent too much time building Avro repositories,
> > >+1
> > >on bringing serializer API back.
> > >
> > >I think it will make the new producer easier to work with.
> > >
> > >Gwen
> > >
> > >On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps  wrote:
> > >
> > >> This is admittedly late in the release cycle to make a change. To add
> to
> > >> Jun's description the motivation was that we felt it would be better
> to
> > >> change that interface now rather than after the release if it needed
> to
> > >> change.
> > >>
> > >> The motivation for wanting to make a change was the ability to really
> be
> > >> able to develop support for Avro and other serialization formats. The
> > >> current status is pretty scattered--there is a schema repository on an
> > >>Avro
> > >> JIRA and another fork of that on github, and a bunch of people we have
> > >> talked to have done similar things for other serialization systems. It
> > >> would be nice if these things could be packaged in such a way that it
> > >>was
> > >> possible to just change a few configs in the producer and get rich
> > >>metadata
> > >> support for messages.
> > >>
> > >> As we were thinking this through we realized that the new api we were
> > >>about
> > >> to introduce was kind of not very compatable with this since it was
> just
> > >> byte[] oriented.
> > >>
> > >> You can always do this by adding some kind of wrapper api that wraps
> the
> > >> producer. But this puts us back in the position of trying to document
> > >>and
> > >> support multiple interfaces.
> > >>
> > >> This also opens up the possibility of adding a MessageValidator or
> > >> MessageInterceptor plug-in transparently so that you can do other
> custom
> > >> validation on the messages you are sending which obviously requires
> > >>access
> > >> to the original object not the byte array.
> > >>
> > >> This api doesn't prevent using byte[] by configuring the
> > >> ByteArraySerializer it works as it currently does.
> > >>
> > >> -Jay
> > >>
> > >> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
> > >>
> > >> > Hi, Everyone,
> > >> >
> > >> > I'd like to start a discussion on whether it makes sense to add the
> > >> > serializer api back to the new java producer. Currently, the new
> java
> > >> > producer takes a byte array for both the key and the value. While
> this
> > >> api
> > >> > is simple, it pushes the serialization logic into the application.
> > >>This
> > >> > makes it hard to reason about what type of data is being sent to
> Kafka
> > >> and
> > >> > also makes it hard to share an implementation of the serializer. For
> > >> > example, to support Avro, the serialization logic could be quite
> > >>involved
> > >> > since it might need to register the Avro schema in some remote
> > >>registry
> > >> and
> > >> > maintain a schema cache locally, etc. Without a serialization api,
> > >>it's
> > >> > impossible to share such an implementation so that people can easily
> > >> reuse.
> > >> > We sort of overlooked this implication during the initial discussion
> > >>of
> > >> the
> > >> > producer api.
> > >> >
> > >> > So,

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-11-25 Thread Jun Rao
The old consumer already takes a deserializer when creating streams. So you
plug in your decoder there.

Thanks,

Jun

On Tue, Nov 25, 2014 at 8:29 AM, Manikumar Reddy 
wrote:

> +1 for this change.
>
> what about de-serializer  class in 0.8.2?  Say i am using new producer with
> Avro and old consumer combination.
> then i need to give custom Decoder implementation for Avro right?.
>
> On Tue, Nov 25, 2014 at 9:19 PM, Joe Stein  wrote:
>
> > The serializer is an expected use of the producer/consumer now and think
> we
> > should continue that support in the new client. As far as breaking the
> API
> > it is why we released the 0.8.2-beta to help get through just these type
> of
> > blocking issues in a way that the community at large could be involved in
> > easier with a build/binaries to download and use from maven also.
> >
> > +1 on the change now prior to the 0.8.2 release.
> >
> > - Joe Stein
> >
> >
> > On Mon, Nov 24, 2014 at 11:43 PM, Sriram Subramanian <
> > srsubraman...@linkedin.com.invalid> wrote:
> >
> > > Looked at the patch. +1 from me.
> > >
> > > On 11/24/14 8:29 PM, "Gwen Shapira"  wrote:
> > >
> > > >As one of the people who spent too much time building Avro
> repositories,
> > > >+1
> > > >on bringing serializer API back.
> > > >
> > > >I think it will make the new producer easier to work with.
> > > >
> > > >Gwen
> > > >
> > > >On Mon, Nov 24, 2014 at 6:13 PM, Jay Kreps 
> wrote:
> > > >
> > > >> This is admittedly late in the release cycle to make a change. To
> add
> > to
> > > >> Jun's description the motivation was that we felt it would be better
> > to
> > > >> change that interface now rather than after the release if it needed
> > to
> > > >> change.
> > > >>
> > > >> The motivation for wanting to make a change was the ability to
> really
> > be
> > > >> able to develop support for Avro and other serialization formats.
> The
> > > >> current status is pretty scattered--there is a schema repository on
> an
> > > >>Avro
> > > >> JIRA and another fork of that on github, and a bunch of people we
> have
> > > >> talked to have done similar things for other serialization systems.
> It
> > > >> would be nice if these things could be packaged in such a way that
> it
> > > >>was
> > > >> possible to just change a few configs in the producer and get rich
> > > >>metadata
> > > >> support for messages.
> > > >>
> > > >> As we were thinking this through we realized that the new api we
> were
> > > >>about
> > > >> to introduce was kind of not very compatable with this since it was
> > just
> > > >> byte[] oriented.
> > > >>
> > > >> You can always do this by adding some kind of wrapper api that wraps
> > the
> > > >> producer. But this puts us back in the position of trying to
> document
> > > >>and
> > > >> support multiple interfaces.
> > > >>
> > > >> This also opens up the possibility of adding a MessageValidator or
> > > >> MessageInterceptor plug-in transparently so that you can do other
> > custom
> > > >> validation on the messages you are sending which obviously requires
> > > >>access
> > > >> to the original object not the byte array.
> > > >>
> > > >> This api doesn't prevent using byte[] by configuring the
> > > >> ByteArraySerializer it works as it currently does.
> > > >>
> > > >> -Jay
> > > >>
> > > >> On Mon, Nov 24, 2014 at 5:58 PM, Jun Rao  wrote:
> > > >>
> > > >> > Hi, Everyone,
> > > >> >
> > > >> > I'd like to start a discussion on whether it makes sense to add
> the
> > > >> > serializer api back to the new java producer. Currently, the new
> > java
> > > >> > producer takes a byte array for both the key and the value. While
> > this
> > > >> api
> > > >> > is simple, it pushes the serialization logic into the application.
> > > >>This
> > > >> > makes it hard to reason about what type of data is being sent to
> > Kafka
> > > >> and
> > > >> > also makes it hard to share an implementation of the serializer.
> For
> > > >> > example, to support Avro, the serialization logic could be quite
> > > >>involved
> > > >> > since it might need to register the Avro schema in some remote
> > > >>registry
> > > >> and
> > > >> > maintain a schema cache locally, etc. Without a serialization api,
> > > >>it's
> > > >> > impossible to share such an implementation so that people can
> easily
> > > >> reuse.
> > > >> > We sort of overlooked this implication during the initial
> discussion
> > > >>of
> > > >> the
> > > >> > producer api.
> > > >> >
> > > >> > So, I'd like to propose an api change to the new producer by
> adding
> > > >>back
> > > >> > the serializer api similar to what we had in the old producer.
> > > >>Specially,
> > > >> > the proposed api changes are the following.
> > > >> >
> > > >> > First, we change KafkaProducer to take generic types K and V for
> the
> > > >>key
> > > >> > and the value, respectively.
> > > >> >
> > > >> > public class KafkaProducer implements Producer {
> > > >> >
> > > >> > public Future send(ProducerRecord record,
> > > >> Callback
>

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Joel Koshy
> makes it hard to reason about what type of data is being sent to Kafka and
> also makes it hard to share an implementation of the serializer. For
> example, to support Avro, the serialization logic could be quite involved
> since it might need to register the Avro schema in some remote registry and
> maintain a schema cache locally, etc. Without a serialization api, it's
> impossible to share such an implementation so that people can easily reuse.
> We sort of overlooked this implication during the initial discussion of the
> producer api.

Thanks for bringing this up and the patch.  My take on this is that
any reasoning about the data itself is more appropriately handled
outside of the core producer API. FWIW, I don't think this was
_overlooked_ during the initial discussion of the producer API
(especially since it was a significant change from the old producer).
IIRC we believed at the time that there is elegance and flexibility in
a simple API that deals with raw bytes. I think it is more accurate to
say that this is a reversal of opinion for some (which is fine) but
personally I'm still in the old camp :) i.e., I really like the
simplicity of the current 0.8.2 producer API and find parameterized
types/generics to be distracting and annoying; and IMO any
data-specific handling is better absorbed at a higher-level than the
core Kafka APIs - possibly by a (very thin) wrapper producer library.
I don't quite see why it is difficult to share different wrapper
implementations; or even ser-de libraries for that matter that people
can invoke before sending to/reading from Kafka.

That said I'm not opposed to the change - it's just that I prefer
what's currently there. So I'm +0 on the proposal.

Thanks,

Joel

On Mon, Nov 24, 2014 at 05:58:50PM -0800, Jun Rao wrote:
> Hi, Everyone,
> 
> I'd like to start a discussion on whether it makes sense to add the
> serializer api back to the new java producer. Currently, the new java
> producer takes a byte array for both the key and the value. While this api
> is simple, it pushes the serialization logic into the application. This
> makes it hard to reason about what type of data is being sent to Kafka and
> also makes it hard to share an implementation of the serializer. For
> example, to support Avro, the serialization logic could be quite involved
> since it might need to register the Avro schema in some remote registry and
> maintain a schema cache locally, etc. Without a serialization api, it's
> impossible to share such an implementation so that people can easily reuse.
> We sort of overlooked this implication during the initial discussion of the
> producer api.
> 
> So, I'd like to propose an api change to the new producer by adding back
> the serializer api similar to what we had in the old producer. Specially,
> the proposed api changes are the following.
> 
> First, we change KafkaProducer to take generic types K and V for the key
> and the value, respectively.
> 
> public class KafkaProducer implements Producer {
> 
> public Future send(ProducerRecord record, Callback
> callback);
> 
> public Future send(ProducerRecord record);
> }
> 
> Second, we add two new configs, one for the key serializer and another for
> the value serializer. Both serializers will default to the byte array
> implementation.
> 
> public class ProducerConfig extends AbstractConfig {
> 
> .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> KEY_SERIALIZER_CLASS_DOC)
> .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> VALUE_SERIALIZER_CLASS_DOC);
> }
> 
> Both serializers will implement the following interface.
> 
> public interface Serializer extends Configurable {
> public byte[] serialize(String topic, T data, boolean isKey);
> 
> public void close();
> }
> 
> This is more or less the same as what's in the old producer. The slight
> differences are (1) the serializer now only requires a parameter-less
> constructor; (2) the serializer has a configure() and a close() method for
> initialization and cleanup, respectively; (3) the serialize() method
> additionally takes the topic and an isKey indicator, both of which are
> useful for things like schema registration.
> 
> The detailed changes are included in KAFKA-1797. For completeness, I also
> made the corresponding changes for the new java consumer api as well.
> 
> Note that the proposed api changes are incompatible with what's in the
> 0.8.2 branch. However, if those api changes are beneficial, it's probably
> better to include them now in the 0.8.2 release, rather than later.
> 
> I'd like to discuss mainly two things in this thread.
> 1. Do people feel that the proposed api changes are reasonable?
> 2. Are there any concerns of including the api changes in the 0.8.2 final
> release?
> 
> Thanks,
> 
> Jun



Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Jun Rao
Joel,

Thanks for the feedback.

Yes, the raw bytes interface is simpler than the Generic api. However, it
just pushes the complexity of dealing with the objects to the application.
We also thought about the layered approach. However, this may confuse the
users since there is no single entry point and it's not clear which layer a
user should be using.

Jun


On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  wrote:

> > makes it hard to reason about what type of data is being sent to Kafka
> and
> > also makes it hard to share an implementation of the serializer. For
> > example, to support Avro, the serialization logic could be quite involved
> > since it might need to register the Avro schema in some remote registry
> and
> > maintain a schema cache locally, etc. Without a serialization api, it's
> > impossible to share such an implementation so that people can easily
> reuse.
> > We sort of overlooked this implication during the initial discussion of
> the
> > producer api.
>
> Thanks for bringing this up and the patch.  My take on this is that
> any reasoning about the data itself is more appropriately handled
> outside of the core producer API. FWIW, I don't think this was
> _overlooked_ during the initial discussion of the producer API
> (especially since it was a significant change from the old producer).
> IIRC we believed at the time that there is elegance and flexibility in
> a simple API that deals with raw bytes. I think it is more accurate to
> say that this is a reversal of opinion for some (which is fine) but
> personally I'm still in the old camp :) i.e., I really like the
> simplicity of the current 0.8.2 producer API and find parameterized
> types/generics to be distracting and annoying; and IMO any
> data-specific handling is better absorbed at a higher-level than the
> core Kafka APIs - possibly by a (very thin) wrapper producer library.
> I don't quite see why it is difficult to share different wrapper
> implementations; or even ser-de libraries for that matter that people
> can invoke before sending to/reading from Kafka.
>
> That said I'm not opposed to the change - it's just that I prefer
> what's currently there. So I'm +0 on the proposal.
>
> Thanks,
>
> Joel
>
> On Mon, Nov 24, 2014 at 05:58:50PM -0800, Jun Rao wrote:
> > Hi, Everyone,
> >
> > I'd like to start a discussion on whether it makes sense to add the
> > serializer api back to the new java producer. Currently, the new java
> > producer takes a byte array for both the key and the value. While this
> api
> > is simple, it pushes the serialization logic into the application. This
> > makes it hard to reason about what type of data is being sent to Kafka
> and
> > also makes it hard to share an implementation of the serializer. For
> > example, to support Avro, the serialization logic could be quite involved
> > since it might need to register the Avro schema in some remote registry
> and
> > maintain a schema cache locally, etc. Without a serialization api, it's
> > impossible to share such an implementation so that people can easily
> reuse.
> > We sort of overlooked this implication during the initial discussion of
> the
> > producer api.
> >
> > So, I'd like to propose an api change to the new producer by adding back
> > the serializer api similar to what we had in the old producer. Specially,
> > the proposed api changes are the following.
> >
> > First, we change KafkaProducer to take generic types K and V for the key
> > and the value, respectively.
> >
> > public class KafkaProducer implements Producer {
> >
> > public Future send(ProducerRecord record,
> Callback
> > callback);
> >
> > public Future send(ProducerRecord record);
> > }
> >
> > Second, we add two new configs, one for the key serializer and another
> for
> > the value serializer. Both serializers will default to the byte array
> > implementation.
> >
> > public class ProducerConfig extends AbstractConfig {
> >
> > .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> > "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> > KEY_SERIALIZER_CLASS_DOC)
> > .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> > "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> > VALUE_SERIALIZER_CLASS_DOC);
> > }
> >
> > Both serializers will implement the following interface.
> >
> > public interface Serializer extends Configurable {
> > public byte[] serialize(String topic, T data, boolean isKey);
> >
> > public void close();
> > }
> >
> > This is more or less the same as what's in the old producer. The slight
> > differences are (1) the serializer now only requires a parameter-less
> > constructor; (2) the serializer has a configure() and a close() method
> for
> > initialization and cleanup, respectively; (3) the serialize() method
> > additionally takes the topic and an isKey indicator, both of which are
> > useful for things like schema registration.
> >
> > The detailed changes are included in KAFKA-1797

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Joel Koshy
Re: pushing complexity of dealing with objects: we're talking about
just a call to a serialize method to convert the object to a byte
array right? Or is there more to it? (To me) that seems less
cumbersome than having to interact with parameterized types. Actually,
can you explain more clearly what you mean by reason about what
type of data is being sent in your original email? I have some
notion of what that means but it is a bit vague and you might have
meant something else.

Thanks,

Joel

On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> Joel,
> 
> Thanks for the feedback.
> 
> Yes, the raw bytes interface is simpler than the Generic api. However, it
> just pushes the complexity of dealing with the objects to the application.
> We also thought about the layered approach. However, this may confuse the
> users since there is no single entry point and it's not clear which layer a
> user should be using.
> 
> Jun
> 
> 
> On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  wrote:
> 
> > > makes it hard to reason about what type of data is being sent to Kafka
> > and
> > > also makes it hard to share an implementation of the serializer. For
> > > example, to support Avro, the serialization logic could be quite involved
> > > since it might need to register the Avro schema in some remote registry
> > and
> > > maintain a schema cache locally, etc. Without a serialization api, it's
> > > impossible to share such an implementation so that people can easily
> > reuse.
> > > We sort of overlooked this implication during the initial discussion of
> > the
> > > producer api.
> >
> > Thanks for bringing this up and the patch.  My take on this is that
> > any reasoning about the data itself is more appropriately handled
> > outside of the core producer API. FWIW, I don't think this was
> > _overlooked_ during the initial discussion of the producer API
> > (especially since it was a significant change from the old producer).
> > IIRC we believed at the time that there is elegance and flexibility in
> > a simple API that deals with raw bytes. I think it is more accurate to
> > say that this is a reversal of opinion for some (which is fine) but
> > personally I'm still in the old camp :) i.e., I really like the
> > simplicity of the current 0.8.2 producer API and find parameterized
> > types/generics to be distracting and annoying; and IMO any
> > data-specific handling is better absorbed at a higher-level than the
> > core Kafka APIs - possibly by a (very thin) wrapper producer library.
> > I don't quite see why it is difficult to share different wrapper
> > implementations; or even ser-de libraries for that matter that people
> > can invoke before sending to/reading from Kafka.
> >
> > That said I'm not opposed to the change - it's just that I prefer
> > what's currently there. So I'm +0 on the proposal.
> >
> > Thanks,
> >
> > Joel
> >
> > On Mon, Nov 24, 2014 at 05:58:50PM -0800, Jun Rao wrote:
> > > Hi, Everyone,
> > >
> > > I'd like to start a discussion on whether it makes sense to add the
> > > serializer api back to the new java producer. Currently, the new java
> > > producer takes a byte array for both the key and the value. While this
> > api
> > > is simple, it pushes the serialization logic into the application. This
> > > makes it hard to reason about what type of data is being sent to Kafka
> > and
> > > also makes it hard to share an implementation of the serializer. For
> > > example, to support Avro, the serialization logic could be quite involved
> > > since it might need to register the Avro schema in some remote registry
> > and
> > > maintain a schema cache locally, etc. Without a serialization api, it's
> > > impossible to share such an implementation so that people can easily
> > reuse.
> > > We sort of overlooked this implication during the initial discussion of
> > the
> > > producer api.
> > >
> > > So, I'd like to propose an api change to the new producer by adding back
> > > the serializer api similar to what we had in the old producer. Specially,
> > > the proposed api changes are the following.
> > >
> > > First, we change KafkaProducer to take generic types K and V for the key
> > > and the value, respectively.
> > >
> > > public class KafkaProducer implements Producer {
> > >
> > > public Future send(ProducerRecord record,
> > Callback
> > > callback);
> > >
> > > public Future send(ProducerRecord record);
> > > }
> > >
> > > Second, we add two new configs, one for the key serializer and another
> > for
> > > the value serializer. Both serializers will default to the byte array
> > > implementation.
> > >
> > > public class ProducerConfig extends AbstractConfig {
> > >
> > > .define(KEY_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> > > "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
> > > KEY_SERIALIZER_CLASS_DOC)
> > > .define(VALUE_SERIALIZER_CLASS_CONFIG, Type.CLASS,
> > > "org.apache.kafka.clients.producer.ByteArraySerializer", Importance.HIGH,
>

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Jay Kreps
Hey Joel, you are right, we discussed this, but I think we didn't think
about it as deeply as we should have. I think our take was strongly shaped
by having a wrapper api at LinkedIn that DOES do the serialization
transparently so I think you are thinking of the producer as just an
implementation detail of that wrapper. Imagine a world where every
application at LinkedIn had to figure that part out themselves. That is,
imagine that what you guys supported was just the raw producer api and that
that just handled bytes. I think in that world the types of data you would
see would be totally funky and standardizing correct usage would be a
massive pain.

Conversely, you could imagine advocating the LinkedIn approach where you
just say, well, every org should wrap up the clients in a way that does
things like serialization and other data checks. The problem with that is
that it (1) it is kind of redundant work and it is likely that the wrapper
will goof some nuances of the apis, and (2) it makes documentation and code
sharing really hard. That is, rather than being able to go to a central
place and read how to use the producer, LinkedIn people need to document
the LinkedIn producer wrapper, and users at LinkedIn need to read about
LinkedIn's wrapper for the producer to understand how to use it. Now
imagine this multiplied over every user.

The idea is that since everyone needs to do this we should just make it
easy to package up the best practice and plug it in. That way the
"contract" your application programs to is just the normal producer api.

-Jay

On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:

> Re: pushing complexity of dealing with objects: we're talking about
> just a call to a serialize method to convert the object to a byte
> array right? Or is there more to it? (To me) that seems less
> cumbersome than having to interact with parameterized types. Actually,
> can you explain more clearly what you mean by reason about what
> type of data is being sent in your original email? I have some
> notion of what that means but it is a bit vague and you might have
> meant something else.
>
> Thanks,
>
> Joel
>
> On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > Joel,
> >
> > Thanks for the feedback.
> >
> > Yes, the raw bytes interface is simpler than the Generic api. However, it
> > just pushes the complexity of dealing with the objects to the
> application.
> > We also thought about the layered approach. However, this may confuse the
> > users since there is no single entry point and it's not clear which
> layer a
> > user should be using.
> >
> > Jun
> >
> >
> > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  wrote:
> >
> > > > makes it hard to reason about what type of data is being sent to
> Kafka
> > > and
> > > > also makes it hard to share an implementation of the serializer. For
> > > > example, to support Avro, the serialization logic could be quite
> involved
> > > > since it might need to register the Avro schema in some remote
> registry
> > > and
> > > > maintain a schema cache locally, etc. Without a serialization api,
> it's
> > > > impossible to share such an implementation so that people can easily
> > > reuse.
> > > > We sort of overlooked this implication during the initial discussion
> of
> > > the
> > > > producer api.
> > >
> > > Thanks for bringing this up and the patch.  My take on this is that
> > > any reasoning about the data itself is more appropriately handled
> > > outside of the core producer API. FWIW, I don't think this was
> > > _overlooked_ during the initial discussion of the producer API
> > > (especially since it was a significant change from the old producer).
> > > IIRC we believed at the time that there is elegance and flexibility in
> > > a simple API that deals with raw bytes. I think it is more accurate to
> > > say that this is a reversal of opinion for some (which is fine) but
> > > personally I'm still in the old camp :) i.e., I really like the
> > > simplicity of the current 0.8.2 producer API and find parameterized
> > > types/generics to be distracting and annoying; and IMO any
> > > data-specific handling is better absorbed at a higher-level than the
> > > core Kafka APIs - possibly by a (very thin) wrapper producer library.
> > > I don't quite see why it is difficult to share different wrapper
> > > implementations; or even ser-de libraries for that matter that people
> > > can invoke before sending to/reading from Kafka.
> > >
> > > That said I'm not opposed to the change - it's just that I prefer
> > > what's currently there. So I'm +0 on the proposal.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Mon, Nov 24, 2014 at 05:58:50PM -0800, Jun Rao wrote:
> > > > Hi, Everyone,
> > > >
> > > > I'd like to start a discussion on whether it makes sense to add the
> > > > serializer api back to the new java producer. Currently, the new java
> > > > producer takes a byte array for both the key and the value. While
> this
> > > api
> > > > is simple

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Joel Koshy
Thanks for the follow-up Jay.  I still don't quite see the issue here
but maybe I just need to process this a bit more. To me "packaging up
the best practice and plug it in" seems to be to expose a simple
low-level API and give people the option to plug in a (possibly
shared) standard serializer in their application configs (or a custom
one if they choose) and invoke that from code. The additional
serialization call is a minor drawback but a very clear and easily
understood step that can be documented.  The serializer can obviously
also do other things such as schema registration. I'm actually not (or
at least I think I'm not) influenced very much by LinkedIn's wrapper.
It's just that I think it is reasonable to expect that in practice
most organizations (big and small) tend to have at least some specific
organization-specific detail that warrants a custom serializer anyway;
and it's going to be easier to override a serializer than an entire
producer API.

Joel

On Tue, Dec 02, 2014 at 11:09:55AM -0800, Jay Kreps wrote:
> Hey Joel, you are right, we discussed this, but I think we didn't think
> about it as deeply as we should have. I think our take was strongly shaped
> by having a wrapper api at LinkedIn that DOES do the serialization
> transparently so I think you are thinking of the producer as just an
> implementation detail of that wrapper. Imagine a world where every
> application at LinkedIn had to figure that part out themselves. That is,
> imagine that what you guys supported was just the raw producer api and that
> that just handled bytes. I think in that world the types of data you would
> see would be totally funky and standardizing correct usage would be a
> massive pain.
> 
> Conversely, you could imagine advocating the LinkedIn approach where you
> just say, well, every org should wrap up the clients in a way that does
> things like serialization and other data checks. The problem with that is
> that it (1) it is kind of redundant work and it is likely that the wrapper
> will goof some nuances of the apis, and (2) it makes documentation and code
> sharing really hard. That is, rather than being able to go to a central
> place and read how to use the producer, LinkedIn people need to document
> the LinkedIn producer wrapper, and users at LinkedIn need to read about
> LinkedIn's wrapper for the producer to understand how to use it. Now
> imagine this multiplied over every user.
> 
> The idea is that since everyone needs to do this we should just make it
> easy to package up the best practice and plug it in. That way the
> "contract" your application programs to is just the normal producer api.
> 
> -Jay
> 
> On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:
> 
> > Re: pushing complexity of dealing with objects: we're talking about
> > just a call to a serialize method to convert the object to a byte
> > array right? Or is there more to it? (To me) that seems less
> > cumbersome than having to interact with parameterized types. Actually,
> > can you explain more clearly what you mean by reason about what
> > type of data is being sent in your original email? I have some
> > notion of what that means but it is a bit vague and you might have
> > meant something else.
> >
> > Thanks,
> >
> > Joel
> >
> > On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > > Joel,
> > >
> > > Thanks for the feedback.
> > >
> > > Yes, the raw bytes interface is simpler than the Generic api. However, it
> > > just pushes the complexity of dealing with the objects to the
> > application.
> > > We also thought about the layered approach. However, this may confuse the
> > > users since there is no single entry point and it's not clear which
> > layer a
> > > user should be using.
> > >
> > > Jun
> > >
> > >
> > > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  wrote:
> > >
> > > > > makes it hard to reason about what type of data is being sent to
> > Kafka
> > > > and
> > > > > also makes it hard to share an implementation of the serializer. For
> > > > > example, to support Avro, the serialization logic could be quite
> > involved
> > > > > since it might need to register the Avro schema in some remote
> > registry
> > > > and
> > > > > maintain a schema cache locally, etc. Without a serialization api,
> > it's
> > > > > impossible to share such an implementation so that people can easily
> > > > reuse.
> > > > > We sort of overlooked this implication during the initial discussion
> > of
> > > > the
> > > > > producer api.
> > > >
> > > > Thanks for bringing this up and the patch.  My take on this is that
> > > > any reasoning about the data itself is more appropriately handled
> > > > outside of the core producer API. FWIW, I don't think this was
> > > > _overlooked_ during the initial discussion of the producer API
> > > > (especially since it was a significant change from the old producer).
> > > > IIRC we believed at the time that there is elegance and flexibility in
> > > > a simple API that deals with raw

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Jun Rao
Joel, Rajiv, Thunder,

The issue with a separate ser/deser library is that if it's not part of the
client API, (1) users may not use it or (2) different users may use it in
different ways. For example, you can imagine that two Avro implementations
have different ways of instantiation (since it's not enforced by the client
API). This makes sharing such kind of libraries harder.

Joel,

As for reason about the data types, take an example of the consumer
application. It needs to deal with objects at some point. So the earlier
that type information is revealed, the clearer it is to the application.
Since the consumer client is the entry point where an application gets the
data,  if the type is enforced there, it makes it clear to all down stream
consumers.

Thanks,

Jun

On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:

> Re: pushing complexity of dealing with objects: we're talking about
> just a call to a serialize method to convert the object to a byte
> array right? Or is there more to it? (To me) that seems less
> cumbersome than having to interact with parameterized types. Actually,
> can you explain more clearly what you mean by reason about what
> type of data is being sent in your original email? I have some
> notion of what that means but it is a bit vague and you might have
> meant something else.
>
> Thanks,
>
> Joel
>
> On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > Joel,
> >
> > Thanks for the feedback.
> >
> > Yes, the raw bytes interface is simpler than the Generic api. However, it
> > just pushes the complexity of dealing with the objects to the
> application.
> > We also thought about the layered approach. However, this may confuse the
> > users since there is no single entry point and it's not clear which
> layer a
> > user should be using.
> >
> > Jun
> >
> >
> > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  wrote:
> >
> > > > makes it hard to reason about what type of data is being sent to
> Kafka
> > > and
> > > > also makes it hard to share an implementation of the serializer. For
> > > > example, to support Avro, the serialization logic could be quite
> involved
> > > > since it might need to register the Avro schema in some remote
> registry
> > > and
> > > > maintain a schema cache locally, etc. Without a serialization api,
> it's
> > > > impossible to share such an implementation so that people can easily
> > > reuse.
> > > > We sort of overlooked this implication during the initial discussion
> of
> > > the
> > > > producer api.
> > >
> > > Thanks for bringing this up and the patch.  My take on this is that
> > > any reasoning about the data itself is more appropriately handled
> > > outside of the core producer API. FWIW, I don't think this was
> > > _overlooked_ during the initial discussion of the producer API
> > > (especially since it was a significant change from the old producer).
> > > IIRC we believed at the time that there is elegance and flexibility in
> > > a simple API that deals with raw bytes. I think it is more accurate to
> > > say that this is a reversal of opinion for some (which is fine) but
> > > personally I'm still in the old camp :) i.e., I really like the
> > > simplicity of the current 0.8.2 producer API and find parameterized
> > > types/generics to be distracting and annoying; and IMO any
> > > data-specific handling is better absorbed at a higher-level than the
> > > core Kafka APIs - possibly by a (very thin) wrapper producer library.
> > > I don't quite see why it is difficult to share different wrapper
> > > implementations; or even ser-de libraries for that matter that people
> > > can invoke before sending to/reading from Kafka.
> > >
> > > That said I'm not opposed to the change - it's just that I prefer
> > > what's currently there. So I'm +0 on the proposal.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Mon, Nov 24, 2014 at 05:58:50PM -0800, Jun Rao wrote:
> > > > Hi, Everyone,
> > > >
> > > > I'd like to start a discussion on whether it makes sense to add the
> > > > serializer api back to the new java producer. Currently, the new java
> > > > producer takes a byte array for both the key and the value. While
> this
> > > api
> > > > is simple, it pushes the serialization logic into the application.
> This
> > > > makes it hard to reason about what type of data is being sent to
> Kafka
> > > and
> > > > also makes it hard to share an implementation of the serializer. For
> > > > example, to support Avro, the serialization logic could be quite
> involved
> > > > since it might need to register the Avro schema in some remote
> registry
> > > and
> > > > maintain a schema cache locally, etc. Without a serialization api,
> it's
> > > > impossible to share such an implementation so that people can easily
> > > reuse.
> > > > We sort of overlooked this implication during the initial discussion
> of
> > > the
> > > > producer api.
> > > >
> > > > So, I'd like to propose an api change to the new producer by adding
> back
> > > > the

RE: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Thunder Stumpges
I'm not sure I agree with this. I feel that the need to have a consistent, well 
documented, shared serialization approach at the organization level is 
important no matter what. How you structure the API doesn't change that or make 
it any easier or "automatic" than before. It is still possible for users on 
different projects to "plug in" the wrong serializer or to "be totally funky". 
In order to make this consistent and completely encapsulated from users, a 
company would *still* need to write a shim layer that configures the correct 
serializer in a consistent way, and *that* still needs to be documented and 
understood.

Regards,
Thunder

-Original Message-
From: Jay Kreps [mailto:j...@confluent.io] 
Sent: Tuesday, December 02, 2014 11:10 AM
To: dev@kafka.apache.org
Cc: us...@kafka.apache.org
Subject: Re: [DISCUSSION] adding the serializer api back to the new java 
producer

Hey Joel, you are right, we discussed this, but I think we didn't think about 
it as deeply as we should have. I think our take was strongly shaped by having 
a wrapper api at LinkedIn that DOES do the serialization transparently so I 
think you are thinking of the producer as just an implementation detail of that 
wrapper. Imagine a world where every application at LinkedIn had to figure that 
part out themselves. That is, imagine that what you guys supported was just the 
raw producer api and that that just handled bytes. I think in that world the 
types of data you would see would be totally funky and standardizing correct 
usage would be a massive pain.

Conversely, you could imagine advocating the LinkedIn approach where you just 
say, well, every org should wrap up the clients in a way that does things like 
serialization and other data checks. The problem with that is that it (1) it is 
kind of redundant work and it is likely that the wrapper will goof some nuances 
of the apis, and (2) it makes documentation and code sharing really hard. That 
is, rather than being able to go to a central place and read how to use the 
producer, LinkedIn people need to document the LinkedIn producer wrapper, and 
users at LinkedIn need to read about LinkedIn's wrapper for the producer to 
understand how to use it. Now imagine this multiplied over every user.

The idea is that since everyone needs to do this we should just make it easy to 
package up the best practice and plug it in. That way the "contract" your 
application programs to is just the normal producer api.

-Jay

On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:

> Re: pushing complexity of dealing with objects: we're talking about 
> just a call to a serialize method to convert the object to a byte 
> array right? Or is there more to it? (To me) that seems less 
> cumbersome than having to interact with parameterized types. Actually, 
> can you explain more clearly what you mean by reason about what 
> type of data is being sent in your original email? I have some 
> notion of what that means but it is a bit vague and you might have 
> meant something else.
>
> Thanks,
>
> Joel
>
> On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > Joel,
> >
> > Thanks for the feedback.
> >
> > Yes, the raw bytes interface is simpler than the Generic api. 
> > However, it just pushes the complexity of dealing with the objects 
> > to the
> application.
> > We also thought about the layered approach. However, this may 
> > confuse the users since there is no single entry point and it's not 
> > clear which
> layer a
> > user should be using.
> >
> > Jun
> >
> >
> > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  wrote:
> >
> > > > makes it hard to reason about what type of data is being sent to
> Kafka
> > > and
> > > > also makes it hard to share an implementation of the serializer. 
> > > > For example, to support Avro, the serialization logic could be 
> > > > quite
> involved
> > > > since it might need to register the Avro schema in some remote
> registry
> > > and
> > > > maintain a schema cache locally, etc. Without a serialization 
> > > > api,
> it's
> > > > impossible to share such an implementation so that people can 
> > > > easily
> > > reuse.
> > > > We sort of overlooked this implication during the initial 
> > > > discussion
> of
> > > the
> > > > producer api.
> > >
> > > Thanks for bringing this up and the patch.  My take on this is 
> > > that any reasoning about the data itself is more appropriately 
> > > handled outside of the core producer API. FWIW, I don't think this 
> > > was _overlook

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Jay Kreps
Yeah totally, far from preventing it, making it easy to specify/encourage a
custom serializer across your org is exactly the kind of thing I was hoping
to make work well. If there is a config that gives the serializer you can
just default this to what you want people to use as some kind of
environment default or just tell people to set the property. A person who
wants to ignore this can, of course, but the easy thing to do will be to
use an off-the-shelf serialization method.

If you really want to enforce it, having an interface for serialization
would also let us optionally check this on the server side (e.g. if you
specify a serializer on the server side we validate that messages are in
this format).

If the api is just bytes of course you can make a serializer you want
people to use, and you can send around an email asking people to use it,
but the easy thing to do will remain "my string".getBytes() or whatever and
lots of people will do that instead.

Here the advantage of config is that (assuming your config system allows
it) you should be able to have some kind of global environment default for
these settings and easily grep across applications to determine what is in
use.

I think in all of this there is no hard and fast technical difference
between these approaches, i.e. there is nothing you can do one way that is
impossible the other way.

But I do think that having a nice way to plug in serialization makes it
much more straight-forward and intuitive to package these things up inside
an organization. It also makes it possible to do validation on the server
side or make other tools that inspect or display messages (e.g. the various
command line tools) and do this in an easily pluggable way across tools.

The concern I was expressing was that in the absence of support for
serialization, what everyone will do is just make a wrapper api that
handles these things (since no one can actually use the producer without
serialization, and you will want to encourage use of the proper thing). The
problem I have with wrapper apis is that they defeat common documentation
and tend to made without as much thought as the primary api.

The advantage of having serialization handled internally is that all you
need to do is know the right config for your organization and any example
usage remains the same.

Hopefully that helps explain the rationale a little more.

-Jay

On Tue, Dec 2, 2014 at 11:53 AM, Joel Koshy  wrote:

> Thanks for the follow-up Jay.  I still don't quite see the issue here
> but maybe I just need to process this a bit more. To me "packaging up
> the best practice and plug it in" seems to be to expose a simple
> low-level API and give people the option to plug in a (possibly
> shared) standard serializer in their application configs (or a custom
> one if they choose) and invoke that from code. The additional
> serialization call is a minor drawback but a very clear and easily
> understood step that can be documented.  The serializer can obviously
> also do other things such as schema registration. I'm actually not (or
> at least I think I'm not) influenced very much by LinkedIn's wrapper.
> It's just that I think it is reasonable to expect that in practice
> most organizations (big and small) tend to have at least some specific
> organization-specific detail that warrants a custom serializer anyway;
> and it's going to be easier to override a serializer than an entire
> producer API.
>
> Joel
>
> On Tue, Dec 02, 2014 at 11:09:55AM -0800, Jay Kreps wrote:
> > Hey Joel, you are right, we discussed this, but I think we didn't think
> > about it as deeply as we should have. I think our take was strongly
> shaped
> > by having a wrapper api at LinkedIn that DOES do the serialization
> > transparently so I think you are thinking of the producer as just an
> > implementation detail of that wrapper. Imagine a world where every
> > application at LinkedIn had to figure that part out themselves. That is,
> > imagine that what you guys supported was just the raw producer api and
> that
> > that just handled bytes. I think in that world the types of data you
> would
> > see would be totally funky and standardizing correct usage would be a
> > massive pain.
> >
> > Conversely, you could imagine advocating the LinkedIn approach where you
> > just say, well, every org should wrap up the clients in a way that does
> > things like serialization and other data checks. The problem with that is
> > that it (1) it is kind of redundant work and it is likely that the
> wrapper
> > will goof some nuances of the apis, and (2) it makes documentation and
> code
> > sharing really hard. That is, rather than being able to go to a central
> > place and read how to use the producer, LinkedIn people need to document
> > the LinkedIn producer wrapper, and users at LinkedIn need to read about
> > LinkedIn's wrapper for the producer to understand how to use it. Now
> > imagine this multiplied over every user.
> >
> > The idea is tha

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Rajiv Kurian
Why can't the organization package the Avro implementation with a kafka
client and distribute that library though? The risk of different users
supplying the kafka client with different serializer/deserializer
implementations still exists.

On Tue, Dec 2, 2014 at 12:11 PM, Jun Rao  wrote:

> Joel, Rajiv, Thunder,
>
> The issue with a separate ser/deser library is that if it's not part of the
> client API, (1) users may not use it or (2) different users may use it in
> different ways. For example, you can imagine that two Avro implementations
> have different ways of instantiation (since it's not enforced by the client
> API). This makes sharing such kind of libraries harder.
>
> Joel,
>
> As for reason about the data types, take an example of the consumer
> application. It needs to deal with objects at some point. So the earlier
> that type information is revealed, the clearer it is to the application.
> Since the consumer client is the entry point where an application gets the
> data,  if the type is enforced there, it makes it clear to all down stream
> consumers.
>
> Thanks,
>
> Jun
>
> On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:
>
> > Re: pushing complexity of dealing with objects: we're talking about
> > just a call to a serialize method to convert the object to a byte
> > array right? Or is there more to it? (To me) that seems less
> > cumbersome than having to interact with parameterized types. Actually,
> > can you explain more clearly what you mean by reason about what
> > type of data is being sent in your original email? I have some
> > notion of what that means but it is a bit vague and you might have
> > meant something else.
> >
> > Thanks,
> >
> > Joel
> >
> > On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > > Joel,
> > >
> > > Thanks for the feedback.
> > >
> > > Yes, the raw bytes interface is simpler than the Generic api. However,
> it
> > > just pushes the complexity of dealing with the objects to the
> > application.
> > > We also thought about the layered approach. However, this may confuse
> the
> > > users since there is no single entry point and it's not clear which
> > layer a
> > > user should be using.
> > >
> > > Jun
> > >
> > >
> > > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy 
> wrote:
> > >
> > > > > makes it hard to reason about what type of data is being sent to
> > Kafka
> > > > and
> > > > > also makes it hard to share an implementation of the serializer.
> For
> > > > > example, to support Avro, the serialization logic could be quite
> > involved
> > > > > since it might need to register the Avro schema in some remote
> > registry
> > > > and
> > > > > maintain a schema cache locally, etc. Without a serialization api,
> > it's
> > > > > impossible to share such an implementation so that people can
> easily
> > > > reuse.
> > > > > We sort of overlooked this implication during the initial
> discussion
> > of
> > > > the
> > > > > producer api.
> > > >
> > > > Thanks for bringing this up and the patch.  My take on this is that
> > > > any reasoning about the data itself is more appropriately handled
> > > > outside of the core producer API. FWIW, I don't think this was
> > > > _overlooked_ during the initial discussion of the producer API
> > > > (especially since it was a significant change from the old producer).
> > > > IIRC we believed at the time that there is elegance and flexibility
> in
> > > > a simple API that deals with raw bytes. I think it is more accurate
> to
> > > > say that this is a reversal of opinion for some (which is fine) but
> > > > personally I'm still in the old camp :) i.e., I really like the
> > > > simplicity of the current 0.8.2 producer API and find parameterized
> > > > types/generics to be distracting and annoying; and IMO any
> > > > data-specific handling is better absorbed at a higher-level than the
> > > > core Kafka APIs - possibly by a (very thin) wrapper producer library.
> > > > I don't quite see why it is difficult to share different wrapper
> > > > implementations; or even ser-de libraries for that matter that people
> > > > can invoke before sending to/reading from Kafka.
> > > >
> > > > That said I'm not opposed to the change - it's just that I prefer
> > > > what's currently there. So I'm +0 on the proposal.
> > > >
> > > > Thanks,
> > > >
> > > > Joel
> > > >
> > > > On Mon, Nov 24, 2014 at 05:58:50PM -0800, Jun Rao wrote:
> > > > > Hi, Everyone,
> > > > >
> > > > > I'd like to start a discussion on whether it makes sense to add the
> > > > > serializer api back to the new java producer. Currently, the new
> java
> > > > > producer takes a byte array for both the key and the value. While
> > this
> > > > api
> > > > > is simple, it pushes the serialization logic into the application.
> > This
> > > > > makes it hard to reason about what type of data is being sent to
> > Kafka
> > > > and
> > > > > also makes it hard to share an implementation of the serializer.
> For
> > > > > example, to support Avro, the s

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Joel Koshy
> The issue with a separate ser/deser library is that if it's not part of the
> client API, (1) users may not use it or (2) different users may use it in
> different ways. For example, you can imagine that two Avro implementations
> have different ways of instantiation (since it's not enforced by the client
> API). This makes sharing such kind of libraries harder.

That is true - but that is also the point I think and it seems
irrelevant to whether it is built-in to the producer's config or
plugged in outside at the application-level. i.e., users will not use
a common implementation if it does not fit their requirements. If a
well-designed, full-featured and correctly implemented avro-or-other
serializer/deserializer is made available there is no reason why that
cannot be shared by different applications.

> As for reason about the data types, take an example of the consumer
> application. It needs to deal with objects at some point. So the earlier
> that type information is revealed, the clearer it is to the application.

Again for this, the only additional step is a call to deserialize. At
some level the application _has_ to deal with the specific data type
and it is thus reasonable to require that a consumed byte array needs
to be deserialized to that type before being used.

I suppose I don't see much benefit in pushing this into the core API
of the producer at the expense of making these changes to the API.  At
the same time, I should be clear that I don't think the proposal is in
any way unreasonable which is why I'm definitely not opposed to it,
but I'm also not convinced that it is necessary.

Thanks,

Joel

> 
> On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:
> 
> > Re: pushing complexity of dealing with objects: we're talking about
> > just a call to a serialize method to convert the object to a byte
> > array right? Or is there more to it? (To me) that seems less
> > cumbersome than having to interact with parameterized types. Actually,
> > can you explain more clearly what you mean by reason about what
> > type of data is being sent in your original email? I have some
> > notion of what that means but it is a bit vague and you might have
> > meant something else.
> >
> > Thanks,
> >
> > Joel
> >
> > On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > > Joel,
> > >
> > > Thanks for the feedback.
> > >
> > > Yes, the raw bytes interface is simpler than the Generic api. However, it
> > > just pushes the complexity of dealing with the objects to the
> > application.
> > > We also thought about the layered approach. However, this may confuse the
> > > users since there is no single entry point and it's not clear which
> > layer a
> > > user should be using.
> > >
> > > Jun
> > >
> > >
> > > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  wrote:
> > >
> > > > > makes it hard to reason about what type of data is being sent to
> > Kafka
> > > > and
> > > > > also makes it hard to share an implementation of the serializer. For
> > > > > example, to support Avro, the serialization logic could be quite
> > involved
> > > > > since it might need to register the Avro schema in some remote
> > registry
> > > > and
> > > > > maintain a schema cache locally, etc. Without a serialization api,
> > it's
> > > > > impossible to share such an implementation so that people can easily
> > > > reuse.
> > > > > We sort of overlooked this implication during the initial discussion
> > of
> > > > the
> > > > > producer api.
> > > >
> > > > Thanks for bringing this up and the patch.  My take on this is that
> > > > any reasoning about the data itself is more appropriately handled
> > > > outside of the core producer API. FWIW, I don't think this was
> > > > _overlooked_ during the initial discussion of the producer API
> > > > (especially since it was a significant change from the old producer).
> > > > IIRC we believed at the time that there is elegance and flexibility in
> > > > a simple API that deals with raw bytes. I think it is more accurate to
> > > > say that this is a reversal of opinion for some (which is fine) but
> > > > personally I'm still in the old camp :) i.e., I really like the
> > > > simplicity of the current 0.8.2 producer API and find parameterized
> > > > types/generics to be distracting and annoying; and IMO any
> > > > data-specific handling is better absorbed at a higher-level than the
> > > > core Kafka APIs - possibly by a (very thin) wrapper producer library.
> > > > I don't quite see why it is difficult to share different wrapper
> > > > implementations; or even ser-de libraries for that matter that people
> > > > can invoke before sending to/reading from Kafka.
> > > >
> > > > That said I'm not opposed to the change - it's just that I prefer
> > > > what's currently there. So I'm +0 on the proposal.
> > > >
> > > > Thanks,
> > > >
> > > > Joel
> > > >
> > > > On Mon, Nov 24, 2014 at 05:58:50PM -0800, Jun Rao wrote:
> > > > > Hi, Everyone,
> > > > >
> > > > > I'd like to start a disc

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Roger Hoover
"It also makes it possible to do validation on the server
side or make other tools that inspect or display messages (e.g. the various
command line tools) and do this in an easily pluggable way across tools."

I agree that it's valuable to have a standard way to plugin serialization
across many tools, especially for producers.  For example, the Kafka
producer might get wrapped by JRuby and exposed as a Logstash plugin
.  With a standard method for
plugging in serdes, one can reuse a serde with any tool that wraps the
standard producer API.  This won't be possible if we rely on custom
wrappers.

On Tue, Dec 2, 2014 at 1:49 PM, Jay Kreps  wrote:

> Yeah totally, far from preventing it, making it easy to specify/encourage a
> custom serializer across your org is exactly the kind of thing I was hoping
> to make work well. If there is a config that gives the serializer you can
> just default this to what you want people to use as some kind of
> environment default or just tell people to set the property. A person who
> wants to ignore this can, of course, but the easy thing to do will be to
> use an off-the-shelf serialization method.
>
> If you really want to enforce it, having an interface for serialization
> would also let us optionally check this on the server side (e.g. if you
> specify a serializer on the server side we validate that messages are in
> this format).
>
> If the api is just bytes of course you can make a serializer you want
> people to use, and you can send around an email asking people to use it,
> but the easy thing to do will remain "my string".getBytes() or whatever and
> lots of people will do that instead.
>
> Here the advantage of config is that (assuming your config system allows
> it) you should be able to have some kind of global environment default for
> these settings and easily grep across applications to determine what is in
> use.
>
> I think in all of this there is no hard and fast technical difference
> between these approaches, i.e. there is nothing you can do one way that is
> impossible the other way.
>
> But I do think that having a nice way to plug in serialization makes it
> much more straight-forward and intuitive to package these things up inside
> an organization. It also makes it possible to do validation on the server
> side or make other tools that inspect or display messages (e.g. the various
> command line tools) and do this in an easily pluggable way across tools.
>
> The concern I was expressing was that in the absence of support for
> serialization, what everyone will do is just make a wrapper api that
> handles these things (since no one can actually use the producer without
> serialization, and you will want to encourage use of the proper thing). The
> problem I have with wrapper apis is that they defeat common documentation
> and tend to made without as much thought as the primary api.
>
> The advantage of having serialization handled internally is that all you
> need to do is know the right config for your organization and any example
> usage remains the same.
>
> Hopefully that helps explain the rationale a little more.
>
> -Jay
>
> On Tue, Dec 2, 2014 at 11:53 AM, Joel Koshy  wrote:
>
> > Thanks for the follow-up Jay.  I still don't quite see the issue here
> > but maybe I just need to process this a bit more. To me "packaging up
> > the best practice and plug it in" seems to be to expose a simple
> > low-level API and give people the option to plug in a (possibly
> > shared) standard serializer in their application configs (or a custom
> > one if they choose) and invoke that from code. The additional
> > serialization call is a minor drawback but a very clear and easily
> > understood step that can be documented.  The serializer can obviously
> > also do other things such as schema registration. I'm actually not (or
> > at least I think I'm not) influenced very much by LinkedIn's wrapper.
> > It's just that I think it is reasonable to expect that in practice
> > most organizations (big and small) tend to have at least some specific
> > organization-specific detail that warrants a custom serializer anyway;
> > and it's going to be easier to override a serializer than an entire
> > producer API.
> >
> > Joel
> >
> > On Tue, Dec 02, 2014 at 11:09:55AM -0800, Jay Kreps wrote:
> > > Hey Joel, you are right, we discussed this, but I think we didn't think
> > > about it as deeply as we should have. I think our take was strongly
> > shaped
> > > by having a wrapper api at LinkedIn that DOES do the serialization
> > > transparently so I think you are thinking of the producer as just an
> > > implementation detail of that wrapper. Imagine a world where every
> > > application at LinkedIn had to figure that part out themselves. That
> is,
> > > imagine that what you guys supported was just the raw producer api and
> > that
> > > that just handled bytes. I think in that world the types of data you
> > would
> > > see wo

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Jun Rao
For (1), yes, but it's easier to make a config change than a code change.
If you are using a third party library, one may not be able to make any
code change.

For (2), it's just that if most consumers always do deserialization after
getting the raw bytes, perhaps it would be better to have these two steps
integrated.

Thanks,

Jun

On Tue, Dec 2, 2014 at 2:05 PM, Joel Koshy  wrote:

> > The issue with a separate ser/deser library is that if it's not part of
> the
> > client API, (1) users may not use it or (2) different users may use it in
> > different ways. For example, you can imagine that two Avro
> implementations
> > have different ways of instantiation (since it's not enforced by the
> client
> > API). This makes sharing such kind of libraries harder.
>
> That is true - but that is also the point I think and it seems
> irrelevant to whether it is built-in to the producer's config or
> plugged in outside at the application-level. i.e., users will not use
> a common implementation if it does not fit their requirements. If a
> well-designed, full-featured and correctly implemented avro-or-other
> serializer/deserializer is made available there is no reason why that
> cannot be shared by different applications.
>
> > As for reason about the data types, take an example of the consumer
> > application. It needs to deal with objects at some point. So the earlier
> > that type information is revealed, the clearer it is to the application.
>
> Again for this, the only additional step is a call to deserialize. At
> some level the application _has_ to deal with the specific data type
> and it is thus reasonable to require that a consumed byte array needs
> to be deserialized to that type before being used.
>
> I suppose I don't see much benefit in pushing this into the core API
> of the producer at the expense of making these changes to the API.  At
> the same time, I should be clear that I don't think the proposal is in
> any way unreasonable which is why I'm definitely not opposed to it,
> but I'm also not convinced that it is necessary.
>
> Thanks,
>
> Joel
>
> >
> > On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:
> >
> > > Re: pushing complexity of dealing with objects: we're talking about
> > > just a call to a serialize method to convert the object to a byte
> > > array right? Or is there more to it? (To me) that seems less
> > > cumbersome than having to interact with parameterized types. Actually,
> > > can you explain more clearly what you mean by reason about what
> > > type of data is being sent in your original email? I have some
> > > notion of what that means but it is a bit vague and you might have
> > > meant something else.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > > > Joel,
> > > >
> > > > Thanks for the feedback.
> > > >
> > > > Yes, the raw bytes interface is simpler than the Generic api.
> However, it
> > > > just pushes the complexity of dealing with the objects to the
> > > application.
> > > > We also thought about the layered approach. However, this may
> confuse the
> > > > users since there is no single entry point and it's not clear which
> > > layer a
> > > > user should be using.
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy 
> wrote:
> > > >
> > > > > > makes it hard to reason about what type of data is being sent to
> > > Kafka
> > > > > and
> > > > > > also makes it hard to share an implementation of the serializer.
> For
> > > > > > example, to support Avro, the serialization logic could be quite
> > > involved
> > > > > > since it might need to register the Avro schema in some remote
> > > registry
> > > > > and
> > > > > > maintain a schema cache locally, etc. Without a serialization
> api,
> > > it's
> > > > > > impossible to share such an implementation so that people can
> easily
> > > > > reuse.
> > > > > > We sort of overlooked this implication during the initial
> discussion
> > > of
> > > > > the
> > > > > > producer api.
> > > > >
> > > > > Thanks for bringing this up and the patch.  My take on this is that
> > > > > any reasoning about the data itself is more appropriately handled
> > > > > outside of the core producer API. FWIW, I don't think this was
> > > > > _overlooked_ during the initial discussion of the producer API
> > > > > (especially since it was a significant change from the old
> producer).
> > > > > IIRC we believed at the time that there is elegance and
> flexibility in
> > > > > a simple API that deals with raw bytes. I think it is more
> accurate to
> > > > > say that this is a reversal of opinion for some (which is fine) but
> > > > > personally I'm still in the old camp :) i.e., I really like the
> > > > > simplicity of the current 0.8.2 producer API and find parameterized
> > > > > types/generics to be distracting and annoying; and IMO any
> > > > > data-specific handling is better absorbed at a higher-level than
> the
> > > > > cor

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Joel Koshy
> For (1), yes, but it's easier to make a config change than a code change.
> If you are using a third party library, one may not be able to make any
> code change.

Doesn't that assume that all organizations have to already share the
same underlying specific data type definition (e.g.,
UniversalAvroRecord). If not, then wouldn't they have to anyway make a
code change anyway to use the shared definition (since that is
required in the parameterized type of the producerrecord and
producer)?  And if they have already made the change to use the said
shared definition then you could just as well have the serializer of
UniversalAvroRecord configured in your application config and have
that replaced if you wish by some other implementation of a serializer
of UniversalAvroRecord (again via config).

> For (2), it's just that if most consumers always do deserialization after
> getting the raw bytes, perhaps it would be better to have these two steps
> integrated.

True, but it is just a marginal and very obvious step that shouldn't
surprise any user.

Thanks,

Joel

> 
> Thanks,
> 
> Jun
> 
> On Tue, Dec 2, 2014 at 2:05 PM, Joel Koshy  wrote:
> 
> > > The issue with a separate ser/deser library is that if it's not part of
> > the
> > > client API, (1) users may not use it or (2) different users may use it in
> > > different ways. For example, you can imagine that two Avro
> > implementations
> > > have different ways of instantiation (since it's not enforced by the
> > client
> > > API). This makes sharing such kind of libraries harder.
> >
> > That is true - but that is also the point I think and it seems
> > irrelevant to whether it is built-in to the producer's config or
> > plugged in outside at the application-level. i.e., users will not use
> > a common implementation if it does not fit their requirements. If a
> > well-designed, full-featured and correctly implemented avro-or-other
> > serializer/deserializer is made available there is no reason why that
> > cannot be shared by different applications.
> >
> > > As for reason about the data types, take an example of the consumer
> > > application. It needs to deal with objects at some point. So the earlier
> > > that type information is revealed, the clearer it is to the application.
> >
> > Again for this, the only additional step is a call to deserialize. At
> > some level the application _has_ to deal with the specific data type
> > and it is thus reasonable to require that a consumed byte array needs
> > to be deserialized to that type before being used.
> >
> > I suppose I don't see much benefit in pushing this into the core API
> > of the producer at the expense of making these changes to the API.  At
> > the same time, I should be clear that I don't think the proposal is in
> > any way unreasonable which is why I'm definitely not opposed to it,
> > but I'm also not convinced that it is necessary.
> >
> > Thanks,
> >
> > Joel
> >
> > >
> > > On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  wrote:
> > >
> > > > Re: pushing complexity of dealing with objects: we're talking about
> > > > just a call to a serialize method to convert the object to a byte
> > > > array right? Or is there more to it? (To me) that seems less
> > > > cumbersome than having to interact with parameterized types. Actually,
> > > > can you explain more clearly what you mean by reason about what
> > > > type of data is being sent in your original email? I have some
> > > > notion of what that means but it is a bit vague and you might have
> > > > meant something else.
> > > >
> > > > Thanks,
> > > >
> > > > Joel
> > > >
> > > > On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > > > > Joel,
> > > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > > Yes, the raw bytes interface is simpler than the Generic api.
> > However, it
> > > > > just pushes the complexity of dealing with the objects to the
> > > > application.
> > > > > We also thought about the layered approach. However, this may
> > confuse the
> > > > > users since there is no single entry point and it's not clear which
> > > > layer a
> > > > > user should be using.
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy 
> > wrote:
> > > > >
> > > > > > > makes it hard to reason about what type of data is being sent to
> > > > Kafka
> > > > > > and
> > > > > > > also makes it hard to share an implementation of the serializer.
> > For
> > > > > > > example, to support Avro, the serialization logic could be quite
> > > > involved
> > > > > > > since it might need to register the Avro schema in some remote
> > > > registry
> > > > > > and
> > > > > > > maintain a schema cache locally, etc. Without a serialization
> > api,
> > > > it's
> > > > > > > impossible to share such an implementation so that people can
> > easily
> > > > > > reuse.
> > > > > > > We sort of overlooked this implication during the initial
> > discussion
> > > > of
> > > > > > the
> > > > > > > producer 

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Rajiv Kurian
I for one use the consumer (Simple Consumer) without any deserialization. I
just take the ByteBuffer wrap it a preallocated flyweight and use it
without creating any objects. I'd ideally not have to wrap this logic in a
deserializer interface. For every one who does do this, it seems like a
very small step.

On Tue, Dec 2, 2014 at 5:12 PM, Joel Koshy  wrote:

> > For (1), yes, but it's easier to make a config change than a code change.
> > If you are using a third party library, one may not be able to make any
> > code change.
>
> Doesn't that assume that all organizations have to already share the
> same underlying specific data type definition (e.g.,
> UniversalAvroRecord). If not, then wouldn't they have to anyway make a
> code change anyway to use the shared definition (since that is
> required in the parameterized type of the producerrecord and
> producer)?  And if they have already made the change to use the said
> shared definition then you could just as well have the serializer of
> UniversalAvroRecord configured in your application config and have
> that replaced if you wish by some other implementation of a serializer
> of UniversalAvroRecord (again via config).
>
> > For (2), it's just that if most consumers always do deserialization after
> > getting the raw bytes, perhaps it would be better to have these two steps
> > integrated.
>
> True, but it is just a marginal and very obvious step that shouldn't
> surprise any user.
>
> Thanks,
>
> Joel
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Dec 2, 2014 at 2:05 PM, Joel Koshy  wrote:
> >
> > > > The issue with a separate ser/deser library is that if it's not part
> of
> > > the
> > > > client API, (1) users may not use it or (2) different users may use
> it in
> > > > different ways. For example, you can imagine that two Avro
> > > implementations
> > > > have different ways of instantiation (since it's not enforced by the
> > > client
> > > > API). This makes sharing such kind of libraries harder.
> > >
> > > That is true - but that is also the point I think and it seems
> > > irrelevant to whether it is built-in to the producer's config or
> > > plugged in outside at the application-level. i.e., users will not use
> > > a common implementation if it does not fit their requirements. If a
> > > well-designed, full-featured and correctly implemented avro-or-other
> > > serializer/deserializer is made available there is no reason why that
> > > cannot be shared by different applications.
> > >
> > > > As for reason about the data types, take an example of the consumer
> > > > application. It needs to deal with objects at some point. So the
> earlier
> > > > that type information is revealed, the clearer it is to the
> application.
> > >
> > > Again for this, the only additional step is a call to deserialize. At
> > > some level the application _has_ to deal with the specific data type
> > > and it is thus reasonable to require that a consumed byte array needs
> > > to be deserialized to that type before being used.
> > >
> > > I suppose I don't see much benefit in pushing this into the core API
> > > of the producer at the expense of making these changes to the API.  At
> > > the same time, I should be clear that I don't think the proposal is in
> > > any way unreasonable which is why I'm definitely not opposed to it,
> > > but I'm also not convinced that it is necessary.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > >
> > > > On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy 
> wrote:
> > > >
> > > > > Re: pushing complexity of dealing with objects: we're talking about
> > > > > just a call to a serialize method to convert the object to a byte
> > > > > array right? Or is there more to it? (To me) that seems less
> > > > > cumbersome than having to interact with parameterized types.
> Actually,
> > > > > can you explain more clearly what you mean by reason about what
> > > > > type of data is being sent in your original email? I have some
> > > > > notion of what that means but it is a bit vague and you might have
> > > > > meant something else.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Joel
> > > > >
> > > > > On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > > > > > Joel,
> > > > > >
> > > > > > Thanks for the feedback.
> > > > > >
> > > > > > Yes, the raw bytes interface is simpler than the Generic api.
> > > However, it
> > > > > > just pushes the complexity of dealing with the objects to the
> > > > > application.
> > > > > > We also thought about the layered approach. However, this may
> > > confuse the
> > > > > > users since there is no single entry point and it's not clear
> which
> > > > > layer a
> > > > > > user should be using.
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 2, 2014 at 12:34 AM, Joel Koshy  >
> > > wrote:
> > > > > >
> > > > > > > > makes it hard to reason about what type of data is being
> sent to
> > > > > Kafka
> > > > > > > and
> > > > > > > > also makes it hard to share an

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Jun Rao
Rajiv,

That's probably a very special use case. Note that even in the new consumer
api w/o the generics, the client is only going to get the byte array back.
So, you won't be able to take advantage of reusing the ByteBuffer in the
underlying responses.

Thanks,

Jun

On Tue, Dec 2, 2014 at 5:26 PM, Rajiv Kurian  wrote:

> I for one use the consumer (Simple Consumer) without any deserialization. I
> just take the ByteBuffer wrap it a preallocated flyweight and use it
> without creating any objects. I'd ideally not have to wrap this logic in a
> deserializer interface. For every one who does do this, it seems like a
> very small step.
>
> On Tue, Dec 2, 2014 at 5:12 PM, Joel Koshy  wrote:
>
> > > For (1), yes, but it's easier to make a config change than a code
> change.
> > > If you are using a third party library, one may not be able to make any
> > > code change.
> >
> > Doesn't that assume that all organizations have to already share the
> > same underlying specific data type definition (e.g.,
> > UniversalAvroRecord). If not, then wouldn't they have to anyway make a
> > code change anyway to use the shared definition (since that is
> > required in the parameterized type of the producerrecord and
> > producer)?  And if they have already made the change to use the said
> > shared definition then you could just as well have the serializer of
> > UniversalAvroRecord configured in your application config and have
> > that replaced if you wish by some other implementation of a serializer
> > of UniversalAvroRecord (again via config).
> >
> > > For (2), it's just that if most consumers always do deserialization
> after
> > > getting the raw bytes, perhaps it would be better to have these two
> steps
> > > integrated.
> >
> > True, but it is just a marginal and very obvious step that shouldn't
> > surprise any user.
> >
> > Thanks,
> >
> > Joel
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Dec 2, 2014 at 2:05 PM, Joel Koshy 
> wrote:
> > >
> > > > > The issue with a separate ser/deser library is that if it's not
> part
> > of
> > > > the
> > > > > client API, (1) users may not use it or (2) different users may use
> > it in
> > > > > different ways. For example, you can imagine that two Avro
> > > > implementations
> > > > > have different ways of instantiation (since it's not enforced by
> the
> > > > client
> > > > > API). This makes sharing such kind of libraries harder.
> > > >
> > > > That is true - but that is also the point I think and it seems
> > > > irrelevant to whether it is built-in to the producer's config or
> > > > plugged in outside at the application-level. i.e., users will not use
> > > > a common implementation if it does not fit their requirements. If a
> > > > well-designed, full-featured and correctly implemented avro-or-other
> > > > serializer/deserializer is made available there is no reason why that
> > > > cannot be shared by different applications.
> > > >
> > > > > As for reason about the data types, take an example of the consumer
> > > > > application. It needs to deal with objects at some point. So the
> > earlier
> > > > > that type information is revealed, the clearer it is to the
> > application.
> > > >
> > > > Again for this, the only additional step is a call to deserialize. At
> > > > some level the application _has_ to deal with the specific data type
> > > > and it is thus reasonable to require that a consumed byte array needs
> > > > to be deserialized to that type before being used.
> > > >
> > > > I suppose I don't see much benefit in pushing this into the core API
> > > > of the producer at the expense of making these changes to the API.
> At
> > > > the same time, I should be clear that I don't think the proposal is
> in
> > > > any way unreasonable which is why I'm definitely not opposed to it,
> > > > but I'm also not convinced that it is necessary.
> > > >
> > > > Thanks,
> > > >
> > > > Joel
> > > >
> > > > >
> > > > > On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy 
> > wrote:
> > > > >
> > > > > > Re: pushing complexity of dealing with objects: we're talking
> about
> > > > > > just a call to a serialize method to convert the object to a byte
> > > > > > array right? Or is there more to it? (To me) that seems less
> > > > > > cumbersome than having to interact with parameterized types.
> > Actually,
> > > > > > can you explain more clearly what you mean by reason about
> what
> > > > > > type of data is being sent in your original email? I have
> some
> > > > > > notion of what that means but it is a bit vague and you might
> have
> > > > > > meant something else.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Joel
> > > > > >
> > > > > > On Tue, Dec 02, 2014 at 09:15:19AM -0800, Jun Rao wrote:
> > > > > > > Joel,
> > > > > > >
> > > > > > > Thanks for the feedback.
> > > > > > >
> > > > > > > Yes, the raw bytes interface is simpler than the Generic api.
> > > > However, it
> > > > > > > just pushes the complexity of dealing with the objec

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-02 Thread Rajiv Kurian
Yeah I am kind of sad about that :(. I just mentioned it to show that there
are material use cases for applications where you expose the underlying
ByteBuffer (I know we were talking about byte arrays) instead of
serializing/deserializing objects -  performance is a big one.


On Tue, Dec 2, 2014 at 5:42 PM, Jun Rao  wrote:

> Rajiv,
>
> That's probably a very special use case. Note that even in the new consumer
> api w/o the generics, the client is only going to get the byte array back.
> So, you won't be able to take advantage of reusing the ByteBuffer in the
> underlying responses.
>
> Thanks,
>
> Jun
>
> On Tue, Dec 2, 2014 at 5:26 PM, Rajiv Kurian  wrote:
>
> > I for one use the consumer (Simple Consumer) without any
> deserialization. I
> > just take the ByteBuffer wrap it a preallocated flyweight and use it
> > without creating any objects. I'd ideally not have to wrap this logic in
> a
> > deserializer interface. For every one who does do this, it seems like a
> > very small step.
> >
> > On Tue, Dec 2, 2014 at 5:12 PM, Joel Koshy  wrote:
> >
> > > > For (1), yes, but it's easier to make a config change than a code
> > change.
> > > > If you are using a third party library, one may not be able to make
> any
> > > > code change.
> > >
> > > Doesn't that assume that all organizations have to already share the
> > > same underlying specific data type definition (e.g.,
> > > UniversalAvroRecord). If not, then wouldn't they have to anyway make a
> > > code change anyway to use the shared definition (since that is
> > > required in the parameterized type of the producerrecord and
> > > producer)?  And if they have already made the change to use the said
> > > shared definition then you could just as well have the serializer of
> > > UniversalAvroRecord configured in your application config and have
> > > that replaced if you wish by some other implementation of a serializer
> > > of UniversalAvroRecord (again via config).
> > >
> > > > For (2), it's just that if most consumers always do deserialization
> > after
> > > > getting the raw bytes, perhaps it would be better to have these two
> > steps
> > > > integrated.
> > >
> > > True, but it is just a marginal and very obvious step that shouldn't
> > > surprise any user.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Dec 2, 2014 at 2:05 PM, Joel Koshy 
> > wrote:
> > > >
> > > > > > The issue with a separate ser/deser library is that if it's not
> > part
> > > of
> > > > > the
> > > > > > client API, (1) users may not use it or (2) different users may
> use
> > > it in
> > > > > > different ways. For example, you can imagine that two Avro
> > > > > implementations
> > > > > > have different ways of instantiation (since it's not enforced by
> > the
> > > > > client
> > > > > > API). This makes sharing such kind of libraries harder.
> > > > >
> > > > > That is true - but that is also the point I think and it seems
> > > > > irrelevant to whether it is built-in to the producer's config or
> > > > > plugged in outside at the application-level. i.e., users will not
> use
> > > > > a common implementation if it does not fit their requirements. If a
> > > > > well-designed, full-featured and correctly implemented
> avro-or-other
> > > > > serializer/deserializer is made available there is no reason why
> that
> > > > > cannot be shared by different applications.
> > > > >
> > > > > > As for reason about the data types, take an example of the
> consumer
> > > > > > application. It needs to deal with objects at some point. So the
> > > earlier
> > > > > > that type information is revealed, the clearer it is to the
> > > application.
> > > > >
> > > > > Again for this, the only additional step is a call to deserialize.
> At
> > > > > some level the application _has_ to deal with the specific data
> type
> > > > > and it is thus reasonable to require that a consumed byte array
> needs
> > > > > to be deserialized to that type before being used.
> > > > >
> > > > > I suppose I don't see much benefit in pushing this into the core
> API
> > > > > of the producer at the expense of making these changes to the API.
> > At
> > > > > the same time, I should be clear that I don't think the proposal is
> > in
> > > > > any way unreasonable which is why I'm definitely not opposed to it,
> > > > > but I'm also not convinced that it is necessary.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Joel
> > > > >
> > > > > >
> > > > > > On Tue, Dec 2, 2014 at 10:06 AM, Joel Koshy  >
> > > wrote:
> > > > > >
> > > > > > > Re: pushing complexity of dealing with objects: we're talking
> > about
> > > > > > > just a call to a serialize method to convert the object to a
> byte
> > > > > > > array right? Or is there more to it? (To me) that seems less
> > > > > > > cumbersome than having to interact with parameterized types.
> > > Actually,
> > > > > > > can you explain more clearly what you mean by reason about
> > what
> > >

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-04 Thread Jiangjie Qin

I'm just thinking instead of binding serialization with producer, another
option is to bind serializer/deserializer with
ProducerRecord/ConsumerRecord (please see the detail proposal below.)
   The arguments for this option is:
A. A single producer could send different message types. There are
several use cases in LinkedIn for per record serializer
- In Samza, there are some in-stream order-sensitive control messages
having different deserializer from other messages.
- There are use cases which need support for sending both Avro messages
and raw bytes.
- Some use cases needs to deserialize some Avro messages into generic
record and some other messages into specific record.
B. In current proposal, the serializer/deserilizer is instantiated
according to config. Compared with that, binding serializer with
ProducerRecord and ConsumerRecord is less error prone.


This option includes the following changes:
A. Add serializer and deserializer interfaces to replace serializer
instance from config.
Public interface Serializer  {
public byte[] serializeKey(K key);
public byte[] serializeValue(V value);
}
Public interface deserializer  {
Public K deserializeKey(byte[] key);
public V deserializeValue(byte[] value);
}

B. Make ProducerRecord and ConsumerRecord abstract class implementing
Serializer  and Deserializer  respectively.
Public abstract class ProducerRecord  implements 
Serializer 
{...}
Public abstract class ConsumerRecord  implements 
Deserializer  {...}

C. Instead of instantiate the serializer/Deserializer from config, let
concrete ProducerRecord/ConsumerRecord extends the abstract class and
override the serialize/deserialize methods.

Public class AvroProducerRecord extends ProducerRecord  {
...
@Override
Public byte[] serializeKey(String key) {Š}
@Override
public byte[] serializeValue(GenericRecord value);
}

Public class AvroConsumerRecord extends ConsumerRecord  {
...
@Override
Public K deserializeKey(byte[] key) {Š}
@Override
public V deserializeValue(byte[] value);
}

D. The producer API changes to
Public class KafkaProducer {
...

Future send (ProducerRecord  
record) {
...
K key = record.serializeKey(record.key);
V value = record.serializedValue(record.value);
BytesProducerRecord bytesProducerRecord = new
BytesProducerRecord(topic, partition, key, value);
...
}
...
}



We also had some brainstorm in LinkedIn and here are the feedbacks:

If the community decide to add the serialization back to new producer,
besides current proposal which changes new producer API to be a template,
there are some other options raised during our discussion:
1) Rather than change current new producer API, we can provide a wrapper
of current new producer (e.g. KafkaSerializedProducer) and make it
available to users. As there is value in the simplicity of current API.

2) If we decide to go with tempalated new producer API, according to
experience in LinkedIn, it might worth considering to instantiate the
serializer in code instead of from config so we can avoid runtime errors
due to dynamic instantiation from config, which is more error prone. If
that is the case, the producer API could be changed to something like:
producer = new Producer(KeySerializer, 
ValueSerializer)

--Jiangjie (Becket) Qin


On 11/24/14, 5:58 PM, "Jun Rao"  wrote:

>Hi, Everyone,
>
>I'd like to start a discussion on whether it makes sense to add the
>serializer api back to the new java producer. Currently, the new java
>producer takes a byte array for both the key and the value. While this api
>is simple, it pushes the serialization logic into the application. This
>makes it hard to reason about what type of data is being sent to Kafka and
>also makes it hard to share an implementation of the serializer. For
>example, to support Avro, the serialization logic could be quite involved
>since it might need to register the Avro schema in some remote registry
>and
>maintain a schema cache locally, etc. Without a serialization api, it's
>impossible to share such an implementation so that people can easily
>reuse.
>We sort of overlooked this implication duri

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-04 Thread Jay Kreps
I agree that having the new Producer(KeySerializer,
ValueSerializer) interface would be useful.

People suggested cases where you want to mix and match serialization types.
The ByteArraySerializer is a no-op that would give the current behavior so
any odd case where you need to mix and match serialization or opt out
entirely are totally possible and won't have any overhead other than the
syntactic burden of declaring the parametric type . However
the expectation is that these cases are rare.

I really really think we should avoid having a second producer interface
like KafkaSerializedProducer. KafkaProducer will give the
serialization free behavior. I think our experience has been that surface
area really matters with these things so let's not have two. That sounds
like a compromise but is actually the worst of all worlds since it
duplicates everything over a fairly minor matter.

-Jay



On Thu, Dec 4, 2014 at 10:33 AM, Jiangjie Qin 
wrote:

>
> I'm just thinking instead of binding serialization with producer, another
> option is to bind serializer/deserializer with
> ProducerRecord/ConsumerRecord (please see the detail proposal below.)
>The arguments for this option is:
> A. A single producer could send different message types. There are
> several use cases in LinkedIn for per record serializer
> - In Samza, there are some in-stream order-sensitive control
> messages
> having different deserializer from other messages.
> - There are use cases which need support for sending both Avro
> messages
> and raw bytes.
> - Some use cases needs to deserialize some Avro messages into
> generic
> record and some other messages into specific record.
> B. In current proposal, the serializer/deserilizer is instantiated
> according to config. Compared with that, binding serializer with
> ProducerRecord and ConsumerRecord is less error prone.
>
>
> This option includes the following changes:
> A. Add serializer and deserializer interfaces to replace serializer
> instance from config.
> Public interface Serializer  {
> public byte[] serializeKey(K key);
> public byte[] serializeValue(V value);
> }
> Public interface deserializer  {
> Public K deserializeKey(byte[] key);
> public V deserializeValue(byte[] value);
> }
>
> B. Make ProducerRecord and ConsumerRecord abstract class
> implementing
> Serializer  and Deserializer  respectively.
> Public abstract class ProducerRecord  implements
> Serializer 
> {...}
> Public abstract class ConsumerRecord  implements
> Deserializer  V> {...}
>
> C. Instead of instantiate the serializer/Deserializer from config,
> let
> concrete ProducerRecord/ConsumerRecord extends the abstract class and
> override the serialize/deserialize methods.
>
> Public class AvroProducerRecord extends ProducerRecord
>  GenericRecord> {
> ...
> @Override
> Public byte[] serializeKey(String key) {Š}
> @Override
> public byte[] serializeValue(GenericRecord value);
> }
>
> Public class AvroConsumerRecord extends ConsumerRecord
>  GenericRecord> {
> ...
> @Override
> Public K deserializeKey(byte[] key) {Š}
> @Override
> public V deserializeValue(byte[] value);
> }
>
> D. The producer API changes to
> Public class KafkaProducer {
> ...
>
> Future send (ProducerRecord 
> record) {
> ...
> K key = record.serializeKey(record.key);
> V value =
> record.serializedValue(record.value);
> BytesProducerRecord bytesProducerRecord =
> new
> BytesProducerRecord(topic, partition, key, value);
> ...
> }
> ...
> }
>
>
>
> We also had some brainstorm in LinkedIn and here are the feedbacks:
>
> If the community decide to add the serialization back to new producer,
> besides current proposal which changes new producer API to be a template,
> there are some other options raised during our discussion:
> 1) Rather than change current new producer API, we can provide a
> wrapper
> of current new producer (e.g. KafkaSerializedProducer) and make it
> available to users. As there is value in the simplicity of current API.
>
> 2) If we decide to go with tempalated new producer API, according
> to
> experience in LinkedIn, it might worth considering to instantiate 

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-08 Thread Jun Rao
Ok, based on all the feedbacks that we have heard, I plan to do the
following.

1. Keep the generic api in KAFKA-1797.
2. Add a new constructor in Producer/Consumer that takes the key and the
value serializer instance.
3. Have KAFKA-1797 reviewed and checked into 0.8.2 and trunk.

This will make it easy for people to reuse common serializers while at the
same time allow people to use the byte array api if one chooses to do so.

I plan to make those changes in the next couple of days unless someone
strongly objects.

Thanks,

Jun


On Fri, Dec 5, 2014 at 5:46 PM, Jiangjie Qin 
wrote:

> Hi Jun,
>
> Thanks for pointing out this. Yes, putting serialization/deserialization
> code into record does lose some flexibility. Some more thinking, I think
> no matter what we do to bind the producer and serializer/deserializer, we
> can always to the same thing on Record, i.e. We can also have some
> constructor like ProducerRecor, Deserializer>. The
> downside of this is that we could potentially have a
> serializer/deserializer instance for each record (that's actually the very
> reason that I propose to put the code in record). This problem could be
> addressed by either using a singleton class or factory for
> serializer/deserializer library. But it might be a little bit complicated
> and we are not able to enforce that to external library either. So it
> seems only make sense if we really want to:
> 1. Have a single simple producer interface.
> AND
> 2. use a single producer send all type of messages
>
> I'm not sure if these requirement are strong enough to make us take the
> complexity of singleton/factory class serializer/deserializer library.
>
> Thanks.
>
> Jiangjie (Becket) Qin
>
> On 12/5/14, 3:16 PM, "Jun Rao"  wrote:
>
> >Jiangjie,
> >
> >The issue with adding the serializer in ProducerRecord is that you need to
> >implement all combinations of serializers for key and value. So, instead
> >of
> >just implementing int and string serializers, you will have to implement
> >all 4 combinations.
> >
> >Adding a new producer constructor like Producer(KeySerializer,
> >ValueSerializer, Properties properties) can be useful.
> >
> >Thanks,
> >
> >Jun
> >
> >On Thu, Dec 4, 2014 at 10:33 AM, Jiangjie Qin 
> >wrote:
> >
> >>
> >> I'm just thinking instead of binding serialization with producer,
> >>another
> >> option is to bind serializer/deserializer with
> >> ProducerRecord/ConsumerRecord (please see the detail proposal below.)
> >>The arguments for this option is:
> >> A. A single producer could send different message types. There
> >>are
> >> several use cases in LinkedIn for per record serializer
> >> - In Samza, there are some in-stream order-sensitive control
> >> messages
> >> having different deserializer from other messages.
> >> - There are use cases which need support for sending both Avro
> >> messages
> >> and raw bytes.
> >> - Some use cases needs to deserialize some Avro messages into
> >> generic
> >> record and some other messages into specific record.
> >> B. In current proposal, the serializer/deserilizer is
> >>instantiated
> >> according to config. Compared with that, binding serializer with
> >> ProducerRecord and ConsumerRecord is less error prone.
> >>
> >>
> >> This option includes the following changes:
> >> A. Add serializer and deserializer interfaces to replace
> >>serializer
> >> instance from config.
> >> Public interface Serializer  {
> >> public byte[] serializeKey(K key);
> >> public byte[] serializeValue(V value);
> >> }
> >> Public interface deserializer  {
> >> Public K deserializeKey(byte[] key);
> >> public V deserializeValue(byte[] value);
> >> }
> >>
> >> B. Make ProducerRecord and ConsumerRecord abstract class
> >> implementing
> >> Serializer  and Deserializer  respectively.
> >> Public abstract class ProducerRecord  implements
> >> Serializer 
> >> {...}
> >> Public abstract class ConsumerRecord  implements
> >> Deserializer  >> V> {...}
> >>
> >> C. Instead of instantiate the serializer/Deserializer from
> >>config,
> >> let
> >> concrete ProducerRecord/ConsumerRecord extends the abstract class and
> >> override the serialize/deserialize methods.
> >>
> >> Public class AvroProducerRecord extends ProducerRecord
> >>  >> GenericRecord> {
> >> ...
> >> @Override
> >> Public byte[] serializeKey(String key) {Š}
> >> @Override
> >> public byte[] serializeValue(GenericRecord
> >>value);
> >> }
> >>
> >> Public class AvroConsumerRecord extends ConsumerRecord
> >>  >> GenericRecord> {
> >> ...
> >> 

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-08 Thread Sriram Subramanian
Thank you Jay. I agree with the issue that you point w.r.t paired
serializers. I also think having mix serialization types is rare. To get
the current behavior, one can simply use a ByteArraySerializer. This is
best understood by talking with many customers and you seem to have done
that. I am convinced about the change.

For the rest who gave -1 or 0 for this proposal, does the answers for the
three points(updated) below seem reasonable? Are these explanations
convincing? 


1. Can we keep the serialization semantics outside the Producer interface
and have simple bytes in / bytes out for the interface (This is what we
have today).

The points for this is to keep the interface simple and usage easy to
understand. The points against this is that it gets hard to share common
usage patterns around serialization/message validations for the future.

2. Can we create a wrapper producer that does the serialization and have
different variants of it for different data formats?

The points for this is again to keep the main API clean. The points
against this is that it duplicates the API, increases the surface area and
creates redundancy for a minor addition.

3. Do we need to support different data types per record? The current
interface (bytes in/bytes out) lets you instantiate one producer and use
it to send multiple data formats. There seems to be some valid use cases
for this.


Mixed serialization types are rare based on interactions with customers.
To get the current behavior, one can simply use a ByteArraySerializer.

On 12/5/14 5:00 PM, "Jay Kreps"  wrote:

>Hey Sriram,
>
>Thanks! I think this is a very helpful summary.
>
>Let me try to address your point about passing in the serde at send time.
>
>I think the first objection is really to the paired key/value serializer
>interfaces. This leads to kind of a weird combinatorial thing where you
>would have an avro/avro serializer a string/avro serializer, a pb/pb
>serializer, and a string/pb serializer, and so on. But your proposal would
>work as well with separate serializers for key and value.
>
>I think the downside is just the one you call out--that this is a corner
>case and you end up with two versions of all the apis to support it. This
>also makes the serializer api more annoying to implement. I think the
>alternative solution to this case and any other we can give people is just
>configuring ByteArraySerializer which gives you basically the api that you
>have now with byte arrays. If this is incredibly common then this would be
>a silly solution, but I guess the belief is that these cases are rare and
>a
>really well implemented avro or json serializer should be 100% of what
>most
>people need.
>
>In practice the cases that actually mix serialization types in a single
>stream are pretty rare I think just because the consumer then has the
>problem of guessing how to deserialize, so most of these will end up with
>at least some marker or schema id or whatever that tells you how to read
>the data. Arguable this mixed serialization with marker is itself a
>serializer type and should have a serializer of its own...
>
>-Jay
>
>On Fri, Dec 5, 2014 at 3:48 PM, Sriram Subramanian <
>srsubraman...@linkedin.com.invalid> wrote:
>
>> This thread has diverged multiple times now and it would be worth
>> summarizing them.
>>
>> There seems to be the following points of discussion -
>>
>> 1. Can we keep the serialization semantics outside the Producer
>>interface
>> and have simple bytes in / bytes out for the interface (This is what we
>> have today).
>>
>> The points for this is to keep the interface simple and usage easy to
>> understand. The points against this is that it gets hard to share common
>> usage patterns around serialization/message validations for the future.
>>
>> 2. Can we create a wrapper producer that does the serialization and have
>> different variants of it for different data formats?
>>
>> The points for this is again to keep the main API clean. The points
>> against this is that it duplicates the API, increases the surface area
>>and
>> creates redundancy for a minor addition.
>>
>> 3. Do we need to support different data types per record? The current
>> interface (bytes in/bytes out) lets you instantiate one producer and use
>> it to send multiple data formats. There seems to be some valid use cases
>> for this.
>>
>> I have still not seen a strong argument against not having this
>> functionality. Can someone provide their views on why we don't need this
>> support that is possible with the current API?
>>
>> One possible approach for the per record serialization would be to
>>define
>>
>> public interface SerDe {
>>   public byte[] serializeKey();
>>
>>   public K deserializeKey();
>>
>>   public byte[] serializeValue();
>>
>>   public V deserializeValue();
>> }
>>
>> This would be used by both the Producer and the Consumer.
>>
>> The send APIs can then be
>>
>> public Future send(ProducerRecord record);
>> public Future send(ProducerRecord

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-15 Thread Joel Koshy
(sorry about the late follow-up late - I'm traveling most of this
month)

I'm likely missing something obvious, but I find the following to be a
somewhat vague point that has been mentioned more than once in this
thread without a clear explanation. i.e., why is it hard to share a
serializer/deserializer implementation and just have the clients call
it before a send/receive? What "usage pattern" cannot be supported by
the simpler API?

> 1. Can we keep the serialization semantics outside the Producer interface
> and have simple bytes in / bytes out for the interface (This is what we
> have today).
> 
> The points for this is to keep the interface simple and usage easy to
> understand. The points against this is that it gets hard to share common
> usage patterns around serialization/message validations for the future.


On Tue, Dec 09, 2014 at 03:51:08AM +, Sriram Subramanian wrote:
> Thank you Jay. I agree with the issue that you point w.r.t paired
> serializers. I also think having mix serialization types is rare. To get
> the current behavior, one can simply use a ByteArraySerializer. This is
> best understood by talking with many customers and you seem to have done
> that. I am convinced about the change.
> 
> For the rest who gave -1 or 0 for this proposal, does the answers for the
> three points(updated) below seem reasonable? Are these explanations
> convincing? 
> 
> 
> 1. Can we keep the serialization semantics outside the Producer interface
> and have simple bytes in / bytes out for the interface (This is what we
> have today).
> 
> The points for this is to keep the interface simple and usage easy to
> understand. The points against this is that it gets hard to share common
> usage patterns around serialization/message validations for the future.
> 
> 2. Can we create a wrapper producer that does the serialization and have
> different variants of it for different data formats?
> 
> The points for this is again to keep the main API clean. The points
> against this is that it duplicates the API, increases the surface area and
> creates redundancy for a minor addition.
> 
> 3. Do we need to support different data types per record? The current
> interface (bytes in/bytes out) lets you instantiate one producer and use
> it to send multiple data formats. There seems to be some valid use cases
> for this.
> 
> 
> Mixed serialization types are rare based on interactions with customers.
> To get the current behavior, one can simply use a ByteArraySerializer.
> 
> On 12/5/14 5:00 PM, "Jay Kreps"  wrote:
> 
> >Hey Sriram,
> >
> >Thanks! I think this is a very helpful summary.
> >
> >Let me try to address your point about passing in the serde at send time.
> >
> >I think the first objection is really to the paired key/value serializer
> >interfaces. This leads to kind of a weird combinatorial thing where you
> >would have an avro/avro serializer a string/avro serializer, a pb/pb
> >serializer, and a string/pb serializer, and so on. But your proposal would
> >work as well with separate serializers for key and value.
> >
> >I think the downside is just the one you call out--that this is a corner
> >case and you end up with two versions of all the apis to support it. This
> >also makes the serializer api more annoying to implement. I think the
> >alternative solution to this case and any other we can give people is just
> >configuring ByteArraySerializer which gives you basically the api that you
> >have now with byte arrays. If this is incredibly common then this would be
> >a silly solution, but I guess the belief is that these cases are rare and
> >a
> >really well implemented avro or json serializer should be 100% of what
> >most
> >people need.
> >
> >In practice the cases that actually mix serialization types in a single
> >stream are pretty rare I think just because the consumer then has the
> >problem of guessing how to deserialize, so most of these will end up with
> >at least some marker or schema id or whatever that tells you how to read
> >the data. Arguable this mixed serialization with marker is itself a
> >serializer type and should have a serializer of its own...
> >
> >-Jay
> >
> >On Fri, Dec 5, 2014 at 3:48 PM, Sriram Subramanian <
> >srsubraman...@linkedin.com.invalid> wrote:
> >
> >> This thread has diverged multiple times now and it would be worth
> >> summarizing them.
> >>
> >> There seems to be the following points of discussion -
> >>
> >> 1. Can we keep the serialization semantics outside the Producer
> >>interface
> >> and have simple bytes in / bytes out for the interface (This is what we
> >> have today).
> >>
> >> The points for this is to keep the interface simple and usage easy to
> >> understand. The points against this is that it gets hard to share common
> >> usage patterns around serialization/message validations for the future.
> >>
> >> 2. Can we create a wrapper producer that does the serialization and have
> >> different variants of it for different data formats?
> >>

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-15 Thread Jun Rao
Joel,

It's just that if the serializer/deserializer is not part of the API, you
can only encourage people to use it through documentation. However, not
everyone will read the documentation if it's not directly used in the API.

Thanks,

Jun

On Mon, Dec 15, 2014 at 2:11 AM, Joel Koshy  wrote:

> (sorry about the late follow-up late - I'm traveling most of this
> month)
>
> I'm likely missing something obvious, but I find the following to be a
> somewhat vague point that has been mentioned more than once in this
> thread without a clear explanation. i.e., why is it hard to share a
> serializer/deserializer implementation and just have the clients call
> it before a send/receive? What "usage pattern" cannot be supported by
> the simpler API?
>
> > 1. Can we keep the serialization semantics outside the Producer interface
> > and have simple bytes in / bytes out for the interface (This is what we
> > have today).
> >
> > The points for this is to keep the interface simple and usage easy to
> > understand. The points against this is that it gets hard to share common
> > usage patterns around serialization/message validations for the future.
>
>
> On Tue, Dec 09, 2014 at 03:51:08AM +, Sriram Subramanian wrote:
> > Thank you Jay. I agree with the issue that you point w.r.t paired
> > serializers. I also think having mix serialization types is rare. To get
> > the current behavior, one can simply use a ByteArraySerializer. This is
> > best understood by talking with many customers and you seem to have done
> > that. I am convinced about the change.
> >
> > For the rest who gave -1 or 0 for this proposal, does the answers for the
> > three points(updated) below seem reasonable? Are these explanations
> > convincing?
> >
> >
> > 1. Can we keep the serialization semantics outside the Producer interface
> > and have simple bytes in / bytes out for the interface (This is what we
> > have today).
> >
> > The points for this is to keep the interface simple and usage easy to
> > understand. The points against this is that it gets hard to share common
> > usage patterns around serialization/message validations for the future.
> >
> > 2. Can we create a wrapper producer that does the serialization and have
> > different variants of it for different data formats?
> >
> > The points for this is again to keep the main API clean. The points
> > against this is that it duplicates the API, increases the surface area
> and
> > creates redundancy for a minor addition.
> >
> > 3. Do we need to support different data types per record? The current
> > interface (bytes in/bytes out) lets you instantiate one producer and use
> > it to send multiple data formats. There seems to be some valid use cases
> > for this.
> >
> >
> > Mixed serialization types are rare based on interactions with customers.
> > To get the current behavior, one can simply use a ByteArraySerializer.
> >
> > On 12/5/14 5:00 PM, "Jay Kreps"  wrote:
> >
> > >Hey Sriram,
> > >
> > >Thanks! I think this is a very helpful summary.
> > >
> > >Let me try to address your point about passing in the serde at send
> time.
> > >
> > >I think the first objection is really to the paired key/value serializer
> > >interfaces. This leads to kind of a weird combinatorial thing where you
> > >would have an avro/avro serializer a string/avro serializer, a pb/pb
> > >serializer, and a string/pb serializer, and so on. But your proposal
> would
> > >work as well with separate serializers for key and value.
> > >
> > >I think the downside is just the one you call out--that this is a corner
> > >case and you end up with two versions of all the apis to support it.
> This
> > >also makes the serializer api more annoying to implement. I think the
> > >alternative solution to this case and any other we can give people is
> just
> > >configuring ByteArraySerializer which gives you basically the api that
> you
> > >have now with byte arrays. If this is incredibly common then this would
> be
> > >a silly solution, but I guess the belief is that these cases are rare
> and
> > >a
> > >really well implemented avro or json serializer should be 100% of what
> > >most
> > >people need.
> > >
> > >In practice the cases that actually mix serialization types in a single
> > >stream are pretty rare I think just because the consumer then has the
> > >problem of guessing how to deserialize, so most of these will end up
> with
> > >at least some marker or schema id or whatever that tells you how to read
> > >the data. Arguable this mixed serialization with marker is itself a
> > >serializer type and should have a serializer of its own...
> > >
> > >-Jay
> > >
> > >On Fri, Dec 5, 2014 at 3:48 PM, Sriram Subramanian <
> > >srsubraman...@linkedin.com.invalid> wrote:
> > >
> > >> This thread has diverged multiple times now and it would be worth
> > >> summarizing them.
> > >>
> > >> There seems to be the following points of discussion -
> > >>
> > >> 1. Can we keep the serialization semantics outside the Prod

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-15 Thread Joel Koshy
Documentation is inevitable even if the serializer/deserializer is
part of the API - since the user has to set it up in the configs. So
again, you can only encourage people to use it through documentation.
The simpler byte-oriented API seems clearer to me because anyone who
needs to send (or receive) a specific data type will _be forced to_
(or actually, _intuitively_) select a serializer (or deserializer) and
will definitely pick an already available implementation if a good one
already exists.

Sorry I still don't get it and this is really the only sticking point
for me, albeit a minor one (which is why I have been +0 all along on
the change). I (and I think many others) would appreciate it if
someone can help me understand this better.  So I will repeat the
question: What "usage pattern" cannot be supported by easily by the
simpler API without adding burden on the user?

Thanks,

Joel

On Mon, Dec 15, 2014 at 11:59:34AM -0800, Jun Rao wrote:
> Joel,
> 
> It's just that if the serializer/deserializer is not part of the API, you
> can only encourage people to use it through documentation. However, not
> everyone will read the documentation if it's not directly used in the API.
> 
> Thanks,
> 
> Jun
> 
> On Mon, Dec 15, 2014 at 2:11 AM, Joel Koshy  wrote:
> 
> > (sorry about the late follow-up late - I'm traveling most of this
> > month)
> >
> > I'm likely missing something obvious, but I find the following to be a
> > somewhat vague point that has been mentioned more than once in this
> > thread without a clear explanation. i.e., why is it hard to share a
> > serializer/deserializer implementation and just have the clients call
> > it before a send/receive? What "usage pattern" cannot be supported by
> > the simpler API?
> >
> > > 1. Can we keep the serialization semantics outside the Producer interface
> > > and have simple bytes in / bytes out for the interface (This is what we
> > > have today).
> > >
> > > The points for this is to keep the interface simple and usage easy to
> > > understand. The points against this is that it gets hard to share common
> > > usage patterns around serialization/message validations for the future.
> >
> >
> > On Tue, Dec 09, 2014 at 03:51:08AM +, Sriram Subramanian wrote:
> > > Thank you Jay. I agree with the issue that you point w.r.t paired
> > > serializers. I also think having mix serialization types is rare. To get
> > > the current behavior, one can simply use a ByteArraySerializer. This is
> > > best understood by talking with many customers and you seem to have done
> > > that. I am convinced about the change.
> > >
> > > For the rest who gave -1 or 0 for this proposal, does the answers for the
> > > three points(updated) below seem reasonable? Are these explanations
> > > convincing?
> > >
> > >
> > > 1. Can we keep the serialization semantics outside the Producer interface
> > > and have simple bytes in / bytes out for the interface (This is what we
> > > have today).
> > >
> > > The points for this is to keep the interface simple and usage easy to
> > > understand. The points against this is that it gets hard to share common
> > > usage patterns around serialization/message validations for the future.
> > >
> > > 2. Can we create a wrapper producer that does the serialization and have
> > > different variants of it for different data formats?
> > >
> > > The points for this is again to keep the main API clean. The points
> > > against this is that it duplicates the API, increases the surface area
> > and
> > > creates redundancy for a minor addition.
> > >
> > > 3. Do we need to support different data types per record? The current
> > > interface (bytes in/bytes out) lets you instantiate one producer and use
> > > it to send multiple data formats. There seems to be some valid use cases
> > > for this.
> > >
> > >
> > > Mixed serialization types are rare based on interactions with customers.
> > > To get the current behavior, one can simply use a ByteArraySerializer.
> > >
> > > On 12/5/14 5:00 PM, "Jay Kreps"  wrote:
> > >
> > > >Hey Sriram,
> > > >
> > > >Thanks! I think this is a very helpful summary.
> > > >
> > > >Let me try to address your point about passing in the serde at send
> > time.
> > > >
> > > >I think the first objection is really to the paired key/value serializer
> > > >interfaces. This leads to kind of a weird combinatorial thing where you
> > > >would have an avro/avro serializer a string/avro serializer, a pb/pb
> > > >serializer, and a string/pb serializer, and so on. But your proposal
> > would
> > > >work as well with separate serializers for key and value.
> > > >
> > > >I think the downside is just the one you call out--that this is a corner
> > > >case and you end up with two versions of all the apis to support it.
> > This
> > > >also makes the serializer api more annoying to implement. I think the
> > > >alternative solution to this case and any other we can give people is
> > just
> > > >configuring ByteArraySerializer which 

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-16 Thread Jun Rao
Joel,

With a byte array interface, of course there is nothing that one can't do.
However, the real question is that whether we want to encourage people to
use it this way or not. Being able to flow just bytes is definitely easier
to get started. That's why many early adopters choose to do it that way.
However, it's often the case that they start feeling the pain later when
some producers change the data format. Their Hive/Pig queries start to
break and it's a painful process to have the issue fixed. So, the purpose
of this api change is really to encourage people to standardize on a single
serializer/deserializer that supports things like data validation and
schema evolution upstream in the producer. Now, suppose there is an Avro
serializer/deserializer implementation. How do we make it easy for people
to adopt? If the serializer is part of the api, we can just say, wire in
the Avro serializer for key and/or value in the config and then you can
start sending Avro records to the producer. If the serializer is not part
of the api, we have to say, first instantiate a key and/or value serializer
this way, send the key to the key serializer to get the key bytes, send the
value to the value serializer to get the value bytes, and finally send the
bytes to the producer. The former will be simpler and likely makes the
adoption easier.

Thanks,

Jun

On Mon, Dec 15, 2014 at 7:20 PM, Joel Koshy  wrote:
>
> Documentation is inevitable even if the serializer/deserializer is
> part of the API - since the user has to set it up in the configs. So
> again, you can only encourage people to use it through documentation.
> The simpler byte-oriented API seems clearer to me because anyone who
> needs to send (or receive) a specific data type will _be forced to_
> (or actually, _intuitively_) select a serializer (or deserializer) and
> will definitely pick an already available implementation if a good one
> already exists.
>
> Sorry I still don't get it and this is really the only sticking point
> for me, albeit a minor one (which is why I have been +0 all along on
> the change). I (and I think many others) would appreciate it if
> someone can help me understand this better.  So I will repeat the
> question: What "usage pattern" cannot be supported by easily by the
> simpler API without adding burden on the user?
>
> Thanks,
>
> Joel
>
> On Mon, Dec 15, 2014 at 11:59:34AM -0800, Jun Rao wrote:
> > Joel,
> >
> > It's just that if the serializer/deserializer is not part of the API, you
> > can only encourage people to use it through documentation. However, not
> > everyone will read the documentation if it's not directly used in the
> API.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Dec 15, 2014 at 2:11 AM, Joel Koshy  wrote:
> >
> > > (sorry about the late follow-up late - I'm traveling most of this
> > > month)
> > >
> > > I'm likely missing something obvious, but I find the following to be a
> > > somewhat vague point that has been mentioned more than once in this
> > > thread without a clear explanation. i.e., why is it hard to share a
> > > serializer/deserializer implementation and just have the clients call
> > > it before a send/receive? What "usage pattern" cannot be supported by
> > > the simpler API?
> > >
> > > > 1. Can we keep the serialization semantics outside the Producer
> interface
> > > > and have simple bytes in / bytes out for the interface (This is what
> we
> > > > have today).
> > > >
> > > > The points for this is to keep the interface simple and usage easy to
> > > > understand. The points against this is that it gets hard to share
> common
> > > > usage patterns around serialization/message validations for the
> future.
> > >
> > >
> > > On Tue, Dec 09, 2014 at 03:51:08AM +, Sriram Subramanian wrote:
> > > > Thank you Jay. I agree with the issue that you point w.r.t paired
> > > > serializers. I also think having mix serialization types is rare. To
> get
> > > > the current behavior, one can simply use a ByteArraySerializer. This
> is
> > > > best understood by talking with many customers and you seem to have
> done
> > > > that. I am convinced about the change.
> > > >
> > > > For the rest who gave -1 or 0 for this proposal, does the answers
> for the
> > > > three points(updated) below seem reasonable? Are these explanations
> > > > convincing?
> > > >
> > > >
> > > > 1. Can we keep the serialization semantics outside the Producer
> interface
> > > > and have simple bytes in / bytes out for the interface (This is what
> we
> > > > have today).
> > > >
> > > > The points for this is to keep the interface simple and usage easy to
> > > > understand. The points against this is that it gets hard to share
> common
> > > > usage patterns around serialization/message validations for the
> future.
> > > >
> > > > 2. Can we create a wrapper producer that does the serialization and
> have
> > > > different variants of it for different data formats?
> > > >
> > > > The points for this is again to keep the ma

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-16 Thread Joel Koshy
Jun,

Thanks for summarizing this - it helps confirm for me that I did not
misunderstand anything in this thread so far; and that I disagree with
the premise that the steps in using the current byte-oriented API is
cumbersome or inflexible. It involves instantiating the K-V
serializers in code (as opposed to config) and a extra (but explicit
- i.e., making it very clear to the user) but simple call to serialize
before sending.

The point about downstream queries breaking can happen just as well
with the implicit serializers/deserializers - since ultimately people
have to instantiate the specific type in their code and if they want
to send it they will.

I think adoption is also equivalent since people will just instantiate
whatever serializer/deserializer they want in one line. Plugging in a
new serializer implementation does require a code change, but that can
also be avoided via a config driven factory.

So I'm still +0 on the change but I'm definitely not against moving
forward with the changes. i.e., unless there is any strong -1 on the
proposal from anyone else.

Thanks,

Joel

> With a byte array interface, of course there is nothing that one can't do.
> However, the real question is that whether we want to encourage people to
> use it this way or not. Being able to flow just bytes is definitely easier
> to get started. That's why many early adopters choose to do it that way.
> However, it's often the case that they start feeling the pain later when
> some producers change the data format. Their Hive/Pig queries start to
> break and it's a painful process to have the issue fixed. So, the purpose
> of this api change is really to encourage people to standardize on a single
> serializer/deserializer that supports things like data validation and
> schema evolution upstream in the producer. Now, suppose there is an Avro
> serializer/deserializer implementation. How do we make it easy for people
> to adopt? If the serializer is part of the api, we can just say, wire in
> the Avro serializer for key and/or value in the config and then you can
> start sending Avro records to the producer. If the serializer is not part
> of the api, we have to say, first instantiate a key and/or value serializer
> this way, send the key to the key serializer to get the key bytes, send the
> value to the value serializer to get the value bytes, and finally send the
> bytes to the producer. The former will be simpler and likely makes the
> adoption easier.
> 
> Thanks,
> 
> Jun
> 
> On Mon, Dec 15, 2014 at 7:20 PM, Joel Koshy  wrote:
> >
> > Documentation is inevitable even if the serializer/deserializer is
> > part of the API - since the user has to set it up in the configs. So
> > again, you can only encourage people to use it through documentation.
> > The simpler byte-oriented API seems clearer to me because anyone who
> > needs to send (or receive) a specific data type will _be forced to_
> > (or actually, _intuitively_) select a serializer (or deserializer) and
> > will definitely pick an already available implementation if a good one
> > already exists.
> >
> > Sorry I still don't get it and this is really the only sticking point
> > for me, albeit a minor one (which is why I have been +0 all along on
> > the change). I (and I think many others) would appreciate it if
> > someone can help me understand this better.  So I will repeat the
> > question: What "usage pattern" cannot be supported by easily by the
> > simpler API without adding burden on the user?
> >
> > Thanks,
> >
> > Joel
> >
> > On Mon, Dec 15, 2014 at 11:59:34AM -0800, Jun Rao wrote:
> > > Joel,
> > >
> > > It's just that if the serializer/deserializer is not part of the API, you
> > > can only encourage people to use it through documentation. However, not
> > > everyone will read the documentation if it's not directly used in the
> > API.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Dec 15, 2014 at 2:11 AM, Joel Koshy  wrote:
> > >
> > > > (sorry about the late follow-up late - I'm traveling most of this
> > > > month)
> > > >
> > > > I'm likely missing something obvious, but I find the following to be a
> > > > somewhat vague point that has been mentioned more than once in this
> > > > thread without a clear explanation. i.e., why is it hard to share a
> > > > serializer/deserializer implementation and just have the clients call
> > > > it before a send/receive? What "usage pattern" cannot be supported by
> > > > the simpler API?
> > > >
> > > > > 1. Can we keep the serialization semantics outside the Producer
> > interface
> > > > > and have simple bytes in / bytes out for the interface (This is what
> > we
> > > > > have today).
> > > > >
> > > > > The points for this is to keep the interface simple and usage easy to
> > > > > understand. The points against this is that it gets hard to share
> > common
> > > > > usage patterns around serialization/message validations for the
> > future.
> > > >
> > > >
> > > > On Tue, Dec 09, 2014 at 03:51

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-17 Thread Jun Rao
Thanks everyone for the feedback and the discussion. The proposed changes
have been checked into both 0.8.2 and trunk.

Jun

On Tue, Dec 16, 2014 at 10:43 PM, Joel Koshy  wrote:
>
> Jun,
>
> Thanks for summarizing this - it helps confirm for me that I did not
> misunderstand anything in this thread so far; and that I disagree with
> the premise that the steps in using the current byte-oriented API is
> cumbersome or inflexible. It involves instantiating the K-V
> serializers in code (as opposed to config) and a extra (but explicit
> - i.e., making it very clear to the user) but simple call to serialize
> before sending.
>
> The point about downstream queries breaking can happen just as well
> with the implicit serializers/deserializers - since ultimately people
> have to instantiate the specific type in their code and if they want
> to send it they will.
>
> I think adoption is also equivalent since people will just instantiate
> whatever serializer/deserializer they want in one line. Plugging in a
> new serializer implementation does require a code change, but that can
> also be avoided via a config driven factory.
>
> So I'm still +0 on the change but I'm definitely not against moving
> forward with the changes. i.e., unless there is any strong -1 on the
> proposal from anyone else.
>
> Thanks,
>
> Joel
>
> > With a byte array interface, of course there is nothing that one can't
> do.
> > However, the real question is that whether we want to encourage people to
> > use it this way or not. Being able to flow just bytes is definitely
> easier
> > to get started. That's why many early adopters choose to do it that way.
> > However, it's often the case that they start feeling the pain later when
> > some producers change the data format. Their Hive/Pig queries start to
> > break and it's a painful process to have the issue fixed. So, the purpose
> > of this api change is really to encourage people to standardize on a
> single
> > serializer/deserializer that supports things like data validation and
> > schema evolution upstream in the producer. Now, suppose there is an Avro
> > serializer/deserializer implementation. How do we make it easy for people
> > to adopt? If the serializer is part of the api, we can just say, wire in
> > the Avro serializer for key and/or value in the config and then you can
> > start sending Avro records to the producer. If the serializer is not part
> > of the api, we have to say, first instantiate a key and/or value
> serializer
> > this way, send the key to the key serializer to get the key bytes, send
> the
> > value to the value serializer to get the value bytes, and finally send
> the
> > bytes to the producer. The former will be simpler and likely makes the
> > adoption easier.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Dec 15, 2014 at 7:20 PM, Joel Koshy  wrote:
> > >
> > > Documentation is inevitable even if the serializer/deserializer is
> > > part of the API - since the user has to set it up in the configs. So
> > > again, you can only encourage people to use it through documentation.
> > > The simpler byte-oriented API seems clearer to me because anyone who
> > > needs to send (or receive) a specific data type will _be forced to_
> > > (or actually, _intuitively_) select a serializer (or deserializer) and
> > > will definitely pick an already available implementation if a good one
> > > already exists.
> > >
> > > Sorry I still don't get it and this is really the only sticking point
> > > for me, albeit a minor one (which is why I have been +0 all along on
> > > the change). I (and I think many others) would appreciate it if
> > > someone can help me understand this better.  So I will repeat the
> > > question: What "usage pattern" cannot be supported by easily by the
> > > simpler API without adding burden on the user?
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Mon, Dec 15, 2014 at 11:59:34AM -0800, Jun Rao wrote:
> > > > Joel,
> > > >
> > > > It's just that if the serializer/deserializer is not part of the
> API, you
> > > > can only encourage people to use it through documentation. However,
> not
> > > > everyone will read the documentation if it's not directly used in the
> > > API.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Dec 15, 2014 at 2:11 AM, Joel Koshy 
> wrote:
> > > >
> > > > > (sorry about the late follow-up late - I'm traveling most of this
> > > > > month)
> > > > >
> > > > > I'm likely missing something obvious, but I find the following to
> be a
> > > > > somewhat vague point that has been mentioned more than once in this
> > > > > thread without a clear explanation. i.e., why is it hard to share a
> > > > > serializer/deserializer implementation and just have the clients
> call
> > > > > it before a send/receive? What "usage pattern" cannot be supported
> by
> > > > > the simpler API?
> > > > >
> > > > > > 1. Can we keep the serialization semantics outside the Producer
> > > interface
> > > > > > and have simple byt

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-17 Thread Rajiv Kurian
Has the mvn repo been updated too?

On Wed, Dec 17, 2014 at 4:31 PM, Jun Rao  wrote:
>
> Thanks everyone for the feedback and the discussion. The proposed changes
> have been checked into both 0.8.2 and trunk.
>
> Jun
>
> On Tue, Dec 16, 2014 at 10:43 PM, Joel Koshy  wrote:
> >
> > Jun,
> >
> > Thanks for summarizing this - it helps confirm for me that I did not
> > misunderstand anything in this thread so far; and that I disagree with
> > the premise that the steps in using the current byte-oriented API is
> > cumbersome or inflexible. It involves instantiating the K-V
> > serializers in code (as opposed to config) and a extra (but explicit
> > - i.e., making it very clear to the user) but simple call to serialize
> > before sending.
> >
> > The point about downstream queries breaking can happen just as well
> > with the implicit serializers/deserializers - since ultimately people
> > have to instantiate the specific type in their code and if they want
> > to send it they will.
> >
> > I think adoption is also equivalent since people will just instantiate
> > whatever serializer/deserializer they want in one line. Plugging in a
> > new serializer implementation does require a code change, but that can
> > also be avoided via a config driven factory.
> >
> > So I'm still +0 on the change but I'm definitely not against moving
> > forward with the changes. i.e., unless there is any strong -1 on the
> > proposal from anyone else.
> >
> > Thanks,
> >
> > Joel
> >
> > > With a byte array interface, of course there is nothing that one can't
> > do.
> > > However, the real question is that whether we want to encourage people
> to
> > > use it this way or not. Being able to flow just bytes is definitely
> > easier
> > > to get started. That's why many early adopters choose to do it that
> way.
> > > However, it's often the case that they start feeling the pain later
> when
> > > some producers change the data format. Their Hive/Pig queries start to
> > > break and it's a painful process to have the issue fixed. So, the
> purpose
> > > of this api change is really to encourage people to standardize on a
> > single
> > > serializer/deserializer that supports things like data validation and
> > > schema evolution upstream in the producer. Now, suppose there is an
> Avro
> > > serializer/deserializer implementation. How do we make it easy for
> people
> > > to adopt? If the serializer is part of the api, we can just say, wire
> in
> > > the Avro serializer for key and/or value in the config and then you can
> > > start sending Avro records to the producer. If the serializer is not
> part
> > > of the api, we have to say, first instantiate a key and/or value
> > serializer
> > > this way, send the key to the key serializer to get the key bytes, send
> > the
> > > value to the value serializer to get the value bytes, and finally send
> > the
> > > bytes to the producer. The former will be simpler and likely makes the
> > > adoption easier.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Dec 15, 2014 at 7:20 PM, Joel Koshy 
> wrote:
> > > >
> > > > Documentation is inevitable even if the serializer/deserializer is
> > > > part of the API - since the user has to set it up in the configs. So
> > > > again, you can only encourage people to use it through documentation.
> > > > The simpler byte-oriented API seems clearer to me because anyone who
> > > > needs to send (or receive) a specific data type will _be forced to_
> > > > (or actually, _intuitively_) select a serializer (or deserializer)
> and
> > > > will definitely pick an already available implementation if a good
> one
> > > > already exists.
> > > >
> > > > Sorry I still don't get it and this is really the only sticking point
> > > > for me, albeit a minor one (which is why I have been +0 all along on
> > > > the change). I (and I think many others) would appreciate it if
> > > > someone can help me understand this better.  So I will repeat the
> > > > question: What "usage pattern" cannot be supported by easily by the
> > > > simpler API without adding burden on the user?
> > > >
> > > > Thanks,
> > > >
> > > > Joel
> > > >
> > > > On Mon, Dec 15, 2014 at 11:59:34AM -0800, Jun Rao wrote:
> > > > > Joel,
> > > > >
> > > > > It's just that if the serializer/deserializer is not part of the
> > API, you
> > > > > can only encourage people to use it through documentation. However,
> > not
> > > > > everyone will read the documentation if it's not directly used in
> the
> > > > API.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Dec 15, 2014 at 2:11 AM, Joel Koshy 
> > wrote:
> > > > >
> > > > > > (sorry about the late follow-up late - I'm traveling most of this
> > > > > > month)
> > > > > >
> > > > > > I'm likely missing something obvious, but I find the following to
> > be a
> > > > > > somewhat vague point that has been mentioned more than once in
> this
> > > > > > thread without a clear explanation. i.e., why is it hard to

Re: [DISCUSSION] adding the serializer api back to the new java producer

2014-12-17 Thread Jun Rao
Not yet. It will be when 0.8.2 is release.

Thanks,

Jun

On Wed, Dec 17, 2014 at 5:24 PM, Rajiv Kurian  wrote:
>
> Has the mvn repo been updated too?
>
> On Wed, Dec 17, 2014 at 4:31 PM, Jun Rao  wrote:
> >
> > Thanks everyone for the feedback and the discussion. The proposed changes
> > have been checked into both 0.8.2 and trunk.
> >
> > Jun
> >
> > On Tue, Dec 16, 2014 at 10:43 PM, Joel Koshy 
> wrote:
> > >
> > > Jun,
> > >
> > > Thanks for summarizing this - it helps confirm for me that I did not
> > > misunderstand anything in this thread so far; and that I disagree with
> > > the premise that the steps in using the current byte-oriented API is
> > > cumbersome or inflexible. It involves instantiating the K-V
> > > serializers in code (as opposed to config) and a extra (but explicit
> > > - i.e., making it very clear to the user) but simple call to serialize
> > > before sending.
> > >
> > > The point about downstream queries breaking can happen just as well
> > > with the implicit serializers/deserializers - since ultimately people
> > > have to instantiate the specific type in their code and if they want
> > > to send it they will.
> > >
> > > I think adoption is also equivalent since people will just instantiate
> > > whatever serializer/deserializer they want in one line. Plugging in a
> > > new serializer implementation does require a code change, but that can
> > > also be avoided via a config driven factory.
> > >
> > > So I'm still +0 on the change but I'm definitely not against moving
> > > forward with the changes. i.e., unless there is any strong -1 on the
> > > proposal from anyone else.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > > With a byte array interface, of course there is nothing that one
> can't
> > > do.
> > > > However, the real question is that whether we want to encourage
> people
> > to
> > > > use it this way or not. Being able to flow just bytes is definitely
> > > easier
> > > > to get started. That's why many early adopters choose to do it that
> > way.
> > > > However, it's often the case that they start feeling the pain later
> > when
> > > > some producers change the data format. Their Hive/Pig queries start
> to
> > > > break and it's a painful process to have the issue fixed. So, the
> > purpose
> > > > of this api change is really to encourage people to standardize on a
> > > single
> > > > serializer/deserializer that supports things like data validation and
> > > > schema evolution upstream in the producer. Now, suppose there is an
> > Avro
> > > > serializer/deserializer implementation. How do we make it easy for
> > people
> > > > to adopt? If the serializer is part of the api, we can just say, wire
> > in
> > > > the Avro serializer for key and/or value in the config and then you
> can
> > > > start sending Avro records to the producer. If the serializer is not
> > part
> > > > of the api, we have to say, first instantiate a key and/or value
> > > serializer
> > > > this way, send the key to the key serializer to get the key bytes,
> send
> > > the
> > > > value to the value serializer to get the value bytes, and finally
> send
> > > the
> > > > bytes to the producer. The former will be simpler and likely makes
> the
> > > > adoption easier.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Dec 15, 2014 at 7:20 PM, Joel Koshy 
> > wrote:
> > > > >
> > > > > Documentation is inevitable even if the serializer/deserializer is
> > > > > part of the API - since the user has to set it up in the configs.
> So
> > > > > again, you can only encourage people to use it through
> documentation.
> > > > > The simpler byte-oriented API seems clearer to me because anyone
> who
> > > > > needs to send (or receive) a specific data type will _be forced to_
> > > > > (or actually, _intuitively_) select a serializer (or deserializer)
> > and
> > > > > will definitely pick an already available implementation if a good
> > one
> > > > > already exists.
> > > > >
> > > > > Sorry I still don't get it and this is really the only sticking
> point
> > > > > for me, albeit a minor one (which is why I have been +0 all along
> on
> > > > > the change). I (and I think many others) would appreciate it if
> > > > > someone can help me understand this better.  So I will repeat the
> > > > > question: What "usage pattern" cannot be supported by easily by the
> > > > > simpler API without adding burden on the user?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Joel
> > > > >
> > > > > On Mon, Dec 15, 2014 at 11:59:34AM -0800, Jun Rao wrote:
> > > > > > Joel,
> > > > > >
> > > > > > It's just that if the serializer/deserializer is not part of the
> > > API, you
> > > > > > can only encourage people to use it through documentation.
> However,
> > > not
> > > > > > everyone will read the documentation if it's not directly used in
> > the
> > > > > API.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Mon, Dec 15, 2014 at 2:11 AM, Joel