Re: Protocol documentation draft

2012-12-04 Thread Jun Rao
Thanks for drafting this. Made some minor edits and added some comments.

Thanks,

Jun

On Thu, Nov 29, 2012 at 3:17 PM, Jay Kreps  wrote:

> I started trying to document the 0.8 protocol from the code and write a
> guide to client implementation. This is meant to be a more user-friendly
> and up-to-date version of the proposal wiki we had on the protocol changes.
>
> Here is what I wrote up so far:
>
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
>
> I would love feedback on this document. It would be great if we could
> document everything you need to know to write a client so people don't need
> to reverse engineer our code.
>
> In doing this I found a number of things, some which I feel should be fixed
> in 0.8 some of which maybe can wait.
>
> 1. Correlation id is not used across all the requests. I don't think it can
> work as intended because of this.
> 2. On reflection I am not sure that we need a correlation id field. I think
> that since we need to guarantee that processing is sequential on any
> particular socket we can correlate with a simple queue. (e.g. as the client
> sends messages it adds them to a queue and as it receives responses it just
> correlates to whatever is at the head of the queue).
> 3. The metadata response seems to have a number of problems. Among them is
> that it weirdly repeats all the broker information many times. The response
> includes the ISR, leader (maybe), and the replicas. Each of these repeat
> all the broker information. This is super weird. I think what we should be
> doing here is including all broker information for all brokers and then
> just having the appropriate ids for the isr, leader, and replicas.
> 4. For topic discovery I think we need to support the case where no topics
> are specified in the metadata request and for this return information about
> all topics. I don't think we do this now.
> 5. I don't understand what the creator id is.
> 6. The offset request and response is not fully thought through and should
> be generalized.
>
> -Jay
>


Re: Protocol documentation draft

2012-11-30 Thread Jay Kreps
Not quite. I noticed a few bugs that I pointed out in the previous email. I
would like to correct these today before we officially freeze it. So maybe
give me through Monday? At that point it would be absolutely fantastic to
have someone try to work off the document rather than off our code, and
implement something. Since that is likely to be a frustrating journey the
best plan might be to just keep me around on IRC when you are working so I
can improve the docs and fix anything that isn't right quickly.

-Jay


On Fri, Nov 30, 2012 at 8:16 AM, David Arthur  wrote:

> Is the 0.8 protocol pretty much frozen at this point? If so, I'd like to
> start updating my python client.
>
> Also, I'm guessing 0.8 will be backwards compatible with 0.7 (due to the
> API versioning)
>
> -David
>
> On Nov 29, 2012, at 7:47 PM, Jay Kreps wrote:
>
> > Okay here is a proposal to address the issues I raised.
> >
> > 1, 2. Correlation id. This is not strictly speaking needed, but it is
> maybe
> > useful for debugging to be able to trace a particular request from client
> > to server. So we will extend this across all the requests.
> > 3. For metadata response I will try to fix this up by normalizing out the
> > broker list and having the isr, replicas, and leader field just have the
> > node id.
> > 4. This should be uncontroversial and easy to add.
> > 5. Let's remove creator id, it isn't used.
> > 6. Let's generalize offset request. My proposal is below:
> >
> > Rename TopicMetadata API to ClusterMetadata, as this will contain all the
> > data that is known cluster-wide. Then let's generalize the offset request
> > to be PartitionMetadata--namely stuff about a particular partition on a
> > particular server.
> >
> > The format of PartitionMetdata would be the following:
> >
> > PartitionMetadataRequest => [TopicName [PartitionId MinSegmentTime
> > MaxSegmentInfos]]
> >  TopicName => string
> >  PartitionId => uint32
> >  MinSegmentTime => uint64
> >  MaxSegmentInfos => int32
> >
> > PartitionMetadataResponse => [TopicName [PartitionMetadata]]
> >  TopicName => string
> >  PartitionMetadata => PartitionId LogSize NumberOfSegments LogEndOffset
> > HighwaterMark [SegmentData]
> >  SegmentData => StartOffset LastModifiedTime
> >  LogSize => uint64
> >  NumberOfSegments => int32
> >  LogEndOffset => int64
> >  HighwaterMark => int64
> >
> > This would be general enough that we could continue to add to it for any
> > new pieces of data we need.
> >
> > -Jay
> >
> >
> > On Thu, Nov 29, 2012 at 3:17 PM, Jay Kreps  wrote:
> >
> >> I started trying to document the 0.8 protocol from the code and write a
> >> guide to client implementation. This is meant to be a more user-friendly
> >> and up-to-date version of the proposal wiki we had on the protocol
> changes.
> >>
> >> Here is what I wrote up so far:
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
> >>
> >> I would love feedback on this document. It would be great if we could
> >> document everything you need to know to write a client so people don't
> need
> >> to reverse engineer our code.
> >>
> >> In doing this I found a number of things, some which I feel should be
> >> fixed in 0.8 some of which maybe can wait.
> >>
> >> 1. Correlation id is not used across all the requests. I don't think it
> >> can work as intended because of this.
> >> 2. On reflection I am not sure that we need a correlation id field. I
> >> think that since we need to guarantee that processing is sequential on
> any
> >> particular socket we can correlate with a simple queue. (e.g. as the
> client
> >> sends messages it adds them to a queue and as it receives responses it
> just
> >> correlates to whatever is at the head of the queue).
> >> 3. The metadata response seems to have a number of problems. Among them
> is
> >> that it weirdly repeats all the broker information many times. The
> response
> >> includes the ISR, leader (maybe), and the replicas. Each of these repeat
> >> all the broker information. This is super weird. I think what we should
> be
> >> doing here is including all broker information for all brokers and then
> >> just having the appropriate ids for the isr, leader, and replicas.
> >> 4. For topic discovery I think we need to support the case where no
> topics
> >> are specified in the metadata request and for this return information
> about
> >> all topics. I don't think we do this now.
> >> 5. I don't understand what the creator id is.
> >> 6. The offset request and response is not fully thought through and
> should
> >> be generalized.
> >>
> >> -Jay
> >>
> >>
>
>


Re: Protocol documentation draft

2012-11-30 Thread David Arthur
Is the 0.8 protocol pretty much frozen at this point? If so, I'd like to start 
updating my python client.

Also, I'm guessing 0.8 will be backwards compatible with 0.7 (due to the API 
versioning)

-David

On Nov 29, 2012, at 7:47 PM, Jay Kreps wrote:

> Okay here is a proposal to address the issues I raised.
> 
> 1, 2. Correlation id. This is not strictly speaking needed, but it is maybe
> useful for debugging to be able to trace a particular request from client
> to server. So we will extend this across all the requests.
> 3. For metadata response I will try to fix this up by normalizing out the
> broker list and having the isr, replicas, and leader field just have the
> node id.
> 4. This should be uncontroversial and easy to add.
> 5. Let's remove creator id, it isn't used.
> 6. Let's generalize offset request. My proposal is below:
> 
> Rename TopicMetadata API to ClusterMetadata, as this will contain all the
> data that is known cluster-wide. Then let's generalize the offset request
> to be PartitionMetadata--namely stuff about a particular partition on a
> particular server.
> 
> The format of PartitionMetdata would be the following:
> 
> PartitionMetadataRequest => [TopicName [PartitionId MinSegmentTime
> MaxSegmentInfos]]
>  TopicName => string
>  PartitionId => uint32
>  MinSegmentTime => uint64
>  MaxSegmentInfos => int32
> 
> PartitionMetadataResponse => [TopicName [PartitionMetadata]]
>  TopicName => string
>  PartitionMetadata => PartitionId LogSize NumberOfSegments LogEndOffset
> HighwaterMark [SegmentData]
>  SegmentData => StartOffset LastModifiedTime
>  LogSize => uint64
>  NumberOfSegments => int32
>  LogEndOffset => int64
>  HighwaterMark => int64
> 
> This would be general enough that we could continue to add to it for any
> new pieces of data we need.
> 
> -Jay
> 
> 
> On Thu, Nov 29, 2012 at 3:17 PM, Jay Kreps  wrote:
> 
>> I started trying to document the 0.8 protocol from the code and write a
>> guide to client implementation. This is meant to be a more user-friendly
>> and up-to-date version of the proposal wiki we had on the protocol changes.
>> 
>> Here is what I wrote up so far:
>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
>> 
>> I would love feedback on this document. It would be great if we could
>> document everything you need to know to write a client so people don't need
>> to reverse engineer our code.
>> 
>> In doing this I found a number of things, some which I feel should be
>> fixed in 0.8 some of which maybe can wait.
>> 
>> 1. Correlation id is not used across all the requests. I don't think it
>> can work as intended because of this.
>> 2. On reflection I am not sure that we need a correlation id field. I
>> think that since we need to guarantee that processing is sequential on any
>> particular socket we can correlate with a simple queue. (e.g. as the client
>> sends messages it adds them to a queue and as it receives responses it just
>> correlates to whatever is at the head of the queue).
>> 3. The metadata response seems to have a number of problems. Among them is
>> that it weirdly repeats all the broker information many times. The response
>> includes the ISR, leader (maybe), and the replicas. Each of these repeat
>> all the broker information. This is super weird. I think what we should be
>> doing here is including all broker information for all brokers and then
>> just having the appropriate ids for the isr, leader, and replicas.
>> 4. For topic discovery I think we need to support the case where no topics
>> are specified in the metadata request and for this return information about
>> all topics. I don't think we do this now.
>> 5. I don't understand what the creator id is.
>> 6. The offset request and response is not fully thought through and should
>> be generalized.
>> 
>> -Jay
>> 
>> 



Re: Protocol documentation draft

2012-11-29 Thread Jay Kreps
Okay here is a proposal to address the issues I raised.

1, 2. Correlation id. This is not strictly speaking needed, but it is maybe
useful for debugging to be able to trace a particular request from client
to server. So we will extend this across all the requests.
3. For metadata response I will try to fix this up by normalizing out the
broker list and having the isr, replicas, and leader field just have the
node id.
4. This should be uncontroversial and easy to add.
5. Let's remove creator id, it isn't used.
6. Let's generalize offset request. My proposal is below:

Rename TopicMetadata API to ClusterMetadata, as this will contain all the
data that is known cluster-wide. Then let's generalize the offset request
to be PartitionMetadata--namely stuff about a particular partition on a
particular server.

The format of PartitionMetdata would be the following:

PartitionMetadataRequest => [TopicName [PartitionId MinSegmentTime
MaxSegmentInfos]]
  TopicName => string
  PartitionId => uint32
  MinSegmentTime => uint64
  MaxSegmentInfos => int32

PartitionMetadataResponse => [TopicName [PartitionMetadata]]
  TopicName => string
  PartitionMetadata => PartitionId LogSize NumberOfSegments LogEndOffset
HighwaterMark [SegmentData]
  SegmentData => StartOffset LastModifiedTime
  LogSize => uint64
  NumberOfSegments => int32
  LogEndOffset => int64
  HighwaterMark => int64

This would be general enough that we could continue to add to it for any
new pieces of data we need.

-Jay


On Thu, Nov 29, 2012 at 3:17 PM, Jay Kreps  wrote:

> I started trying to document the 0.8 protocol from the code and write a
> guide to client implementation. This is meant to be a more user-friendly
> and up-to-date version of the proposal wiki we had on the protocol changes.
>
> Here is what I wrote up so far:
>
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
>
> I would love feedback on this document. It would be great if we could
> document everything you need to know to write a client so people don't need
> to reverse engineer our code.
>
> In doing this I found a number of things, some which I feel should be
> fixed in 0.8 some of which maybe can wait.
>
> 1. Correlation id is not used across all the requests. I don't think it
> can work as intended because of this.
> 2. On reflection I am not sure that we need a correlation id field. I
> think that since we need to guarantee that processing is sequential on any
> particular socket we can correlate with a simple queue. (e.g. as the client
> sends messages it adds them to a queue and as it receives responses it just
> correlates to whatever is at the head of the queue).
> 3. The metadata response seems to have a number of problems. Among them is
> that it weirdly repeats all the broker information many times. The response
> includes the ISR, leader (maybe), and the replicas. Each of these repeat
> all the broker information. This is super weird. I think what we should be
> doing here is including all broker information for all brokers and then
> just having the appropriate ids for the isr, leader, and replicas.
> 4. For topic discovery I think we need to support the case where no topics
> are specified in the metadata request and for this return information about
> all topics. I don't think we do this now.
> 5. I don't understand what the creator id is.
> 6. The offset request and response is not fully thought through and should
> be generalized.
>
> -Jay
>
>


Protocol documentation draft

2012-11-29 Thread Jay Kreps
I started trying to document the 0.8 protocol from the code and write a
guide to client implementation. This is meant to be a more user-friendly
and up-to-date version of the proposal wiki we had on the protocol changes.

Here is what I wrote up so far:
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

I would love feedback on this document. It would be great if we could
document everything you need to know to write a client so people don't need
to reverse engineer our code.

In doing this I found a number of things, some which I feel should be fixed
in 0.8 some of which maybe can wait.

1. Correlation id is not used across all the requests. I don't think it can
work as intended because of this.
2. On reflection I am not sure that we need a correlation id field. I think
that since we need to guarantee that processing is sequential on any
particular socket we can correlate with a simple queue. (e.g. as the client
sends messages it adds them to a queue and as it receives responses it just
correlates to whatever is at the head of the queue).
3. The metadata response seems to have a number of problems. Among them is
that it weirdly repeats all the broker information many times. The response
includes the ISR, leader (maybe), and the replicas. Each of these repeat
all the broker information. This is super weird. I think what we should be
doing here is including all broker information for all brokers and then
just having the appropriate ids for the isr, leader, and replicas.
4. For topic discovery I think we need to support the case where no topics
are specified in the metadata request and for this return information about
all topics. I don't think we do this now.
5. I don't understand what the creator id is.
6. The offset request and response is not fully thought through and should
be generalized.

-Jay