Re: Complete Kafka replication protocol description
Hi Jack, thanks for providing this material. I'll spend some time digging into it. On Mon, Sep 11, 2023 at 6:56 PM Jack Vanlightly wrote: > > Hi all, > > I agree that we should have the protocol description and specification in > the Kafka repository. I have two other specifications to contribute > including the KRaft implementation of Raft and the next gen consumer group > protocol (WIP). I also have a formal prose description of the Kafka > replication protocol for how it works today. That should appear under the > 3.5 directory sometime this week. > > Regarding moving this kind of stuff into the Kafka repo, there are a number > of additional topics such as: > >- Where to place formal prose descriptions and specifications of >protocols in the Kafka repo. I'm thinking about a dedicated folder (i.e. tlaplus) with a main README.md where we briefly explain what's contained. >- Given we may add more specifications and potentially other formal >prose descriptions, how to structure things? Each spec would live in its own sub folder under specs, along with a README.md containing the prose description. I think we should also provide the configuration files, and a bash script to easily run the model checker on them. Wdyt? >- Do we use markdown like I have done here with svg files (created in >Excalidraw with the Excalidraw bits embedded in the svg)., or do we use >LaTex? Some people may wish to read this directly on GitHub and perhaps >others like a PDF. Markdown is more accessible and perhaps more likely to >fulfil the living document approach. I agree that Markdown is more convenient and will increase the chance that these specifications will receive updates. > > Thanks > Jack > > > On Mon, Sep 11, 2023 at 1:45 PM Divij Vaidya > wrote: > > > This is very useful Jack! > > > > 1. We are missing a replication protocol specification in our project. > > Ideally it should be a living document and adding what you wrote to > > the docs/design would be a great start towards that goal. > > 2. I have also been building on top of some existing TLA+ to add > > changes to replication protocol brought by features such as Tiered > > Storage, local retention etc. at > > > > https://github.com/divijvaidya/kafka-specification/blob/master/KafkaReplication.tla > > 3. Apart from verifying correctness, I believe a TLA+ model will also > > help developers quickly iterate through their fundamental assumptions > > while making changes. As an example, we recently had a discussion in > > the community on whether to increase leader epoch with shrink/expand > > ISR or not. There was another discussion on whether we can choose the > > replica with the largest end offset as the new leader to reduce > > truncation. In both these cases, we could have quickly modified the > > existing TLA+ model, run through it and verify that the assumptions > > still hold true. It would be great if we can take such discussions as > > an example and demonstrate how TLA+ could have benefitted the > > community. It would help make the case for adding the TLA+ spec as > > part of the community owned project itself. > > > > Right now, things are a bit busy on my end, but I am looking forward > > to exploring what you shared above in the coming weeks (perhaps a > > first review by end of sept). > > > > Thank you again for starting this conversation. > > > > -- > > Divij Vaidya > > > > On Mon, Sep 11, 2023 at 4:49 AM Haruki Okada wrote: > > > > > > Hi Jack, > > > > > > Thank you for the great work, not only the spec but also for the > > > comprehensive documentation about the replication. > > > Actually I wrote some TLA+ spec to verify unclean leader election > > behavior > > > before so I will double-check my understanding with your complete spec :) > > > > > > > > > Thanks, > > > > > > 2023年9月10日(日) 21:42 David Jacot : > > > > > > > Hi Jack, > > > > > > > > This is great! Thanks for doing it. I will look into it when I have a > > bit > > > > of time, likely after Current. > > > > > > > > Would you be interested in contributing it to the main repository? That > > > > would be a great contribution to the project. Having it there would > > allow > > > > the community to maintain it while changes to the protocol are made. > > That > > > > would also pave the way for having other specs in the future (e.g. new > > > > consumer group protocol). > > > > > > > > Best, > > > > David > > > > > > > > Le dim. 10 sept. 2023 à 12:45, Jack Vanlightly > > a > > > > écrit : > > > > > > > > > Hi all, > > > > > > > > > > As part of my work on formally verifying different parts of Apache > > Kafka > > > > > and working on KIP-966 I have built up a lot of knowledge about how > > the > > > > > replication protocol works. Currently it is mostly documented across > > > > > various KIPs and in the code itself. I have written a complete > > protocol > > > > > description (with KIP-966 changes applied) which is inspired by the > > > > precise > >
Re: Complete Kafka replication protocol description
Hi all, I agree that we should have the protocol description and specification in the Kafka repository. I have two other specifications to contribute including the KRaft implementation of Raft and the next gen consumer group protocol (WIP). I also have a formal prose description of the Kafka replication protocol for how it works today. That should appear under the 3.5 directory sometime this week. Regarding moving this kind of stuff into the Kafka repo, there are a number of additional topics such as: - Where to place formal prose descriptions and specifications of protocols in the Kafka repo. - Given we may add more specifications and potentially other formal prose descriptions, how to structure things? - Do we use markdown like I have done here with svg files (created in Excalidraw with the Excalidraw bits embedded in the svg)., or do we use LaTex? Some people may wish to read this directly on GitHub and perhaps others like a PDF. Markdown is more accessible and perhaps more likely to fulfil the living document approach. Thanks Jack On Mon, Sep 11, 2023 at 1:45 PM Divij Vaidya wrote: > This is very useful Jack! > > 1. We are missing a replication protocol specification in our project. > Ideally it should be a living document and adding what you wrote to > the docs/design would be a great start towards that goal. > 2. I have also been building on top of some existing TLA+ to add > changes to replication protocol brought by features such as Tiered > Storage, local retention etc. at > > https://github.com/divijvaidya/kafka-specification/blob/master/KafkaReplication.tla > 3. Apart from verifying correctness, I believe a TLA+ model will also > help developers quickly iterate through their fundamental assumptions > while making changes. As an example, we recently had a discussion in > the community on whether to increase leader epoch with shrink/expand > ISR or not. There was another discussion on whether we can choose the > replica with the largest end offset as the new leader to reduce > truncation. In both these cases, we could have quickly modified the > existing TLA+ model, run through it and verify that the assumptions > still hold true. It would be great if we can take such discussions as > an example and demonstrate how TLA+ could have benefitted the > community. It would help make the case for adding the TLA+ spec as > part of the community owned project itself. > > Right now, things are a bit busy on my end, but I am looking forward > to exploring what you shared above in the coming weeks (perhaps a > first review by end of sept). > > Thank you again for starting this conversation. > > -- > Divij Vaidya > > On Mon, Sep 11, 2023 at 4:49 AM Haruki Okada wrote: > > > > Hi Jack, > > > > Thank you for the great work, not only the spec but also for the > > comprehensive documentation about the replication. > > Actually I wrote some TLA+ spec to verify unclean leader election > behavior > > before so I will double-check my understanding with your complete spec :) > > > > > > Thanks, > > > > 2023年9月10日(日) 21:42 David Jacot : > > > > > Hi Jack, > > > > > > This is great! Thanks for doing it. I will look into it when I have a > bit > > > of time, likely after Current. > > > > > > Would you be interested in contributing it to the main repository? That > > > would be a great contribution to the project. Having it there would > allow > > > the community to maintain it while changes to the protocol are made. > That > > > would also pave the way for having other specs in the future (e.g. new > > > consumer group protocol). > > > > > > Best, > > > David > > > > > > Le dim. 10 sept. 2023 à 12:45, Jack Vanlightly > a > > > écrit : > > > > > > > Hi all, > > > > > > > > As part of my work on formally verifying different parts of Apache > Kafka > > > > and working on KIP-966 I have built up a lot of knowledge about how > the > > > > replication protocol works. Currently it is mostly documented across > > > > various KIPs and in the code itself. I have written a complete > protocol > > > > description (with KIP-966 changes applied) which is inspired by the > > > precise > > > > but accessible style and language of the Raft paper. The idea is > that it > > > > makes it easier for contributors and anyone else interested in the > > > protocol > > > > to learn how it works, the fundamental properties it has and how > those > > > > properties are supported by the various behaviors and conditions. > > > > > > > > It currently resides next to the TLA+ specification itself in my > > > > kafka-tlaplus repository. I'd be interested to receive feedback from > the > > > > community. > > > > > > > > > > > > > > > > https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/kip-966/description/0_kafka_replication_protocol.md > > > > > > > > Thanks > > > > Jack > > > > > > > > > > > > > -- > > > > Okada Haruki > > ocadar...@gmail.com > >
Re: Complete Kafka replication protocol description
This is very useful Jack! 1. We are missing a replication protocol specification in our project. Ideally it should be a living document and adding what you wrote to the docs/design would be a great start towards that goal. 2. I have also been building on top of some existing TLA+ to add changes to replication protocol brought by features such as Tiered Storage, local retention etc. at https://github.com/divijvaidya/kafka-specification/blob/master/KafkaReplication.tla 3. Apart from verifying correctness, I believe a TLA+ model will also help developers quickly iterate through their fundamental assumptions while making changes. As an example, we recently had a discussion in the community on whether to increase leader epoch with shrink/expand ISR or not. There was another discussion on whether we can choose the replica with the largest end offset as the new leader to reduce truncation. In both these cases, we could have quickly modified the existing TLA+ model, run through it and verify that the assumptions still hold true. It would be great if we can take such discussions as an example and demonstrate how TLA+ could have benefitted the community. It would help make the case for adding the TLA+ spec as part of the community owned project itself. Right now, things are a bit busy on my end, but I am looking forward to exploring what you shared above in the coming weeks (perhaps a first review by end of sept). Thank you again for starting this conversation. -- Divij Vaidya On Mon, Sep 11, 2023 at 4:49 AM Haruki Okada wrote: > > Hi Jack, > > Thank you for the great work, not only the spec but also for the > comprehensive documentation about the replication. > Actually I wrote some TLA+ spec to verify unclean leader election behavior > before so I will double-check my understanding with your complete spec :) > > > Thanks, > > 2023年9月10日(日) 21:42 David Jacot : > > > Hi Jack, > > > > This is great! Thanks for doing it. I will look into it when I have a bit > > of time, likely after Current. > > > > Would you be interested in contributing it to the main repository? That > > would be a great contribution to the project. Having it there would allow > > the community to maintain it while changes to the protocol are made. That > > would also pave the way for having other specs in the future (e.g. new > > consumer group protocol). > > > > Best, > > David > > > > Le dim. 10 sept. 2023 à 12:45, Jack Vanlightly a > > écrit : > > > > > Hi all, > > > > > > As part of my work on formally verifying different parts of Apache Kafka > > > and working on KIP-966 I have built up a lot of knowledge about how the > > > replication protocol works. Currently it is mostly documented across > > > various KIPs and in the code itself. I have written a complete protocol > > > description (with KIP-966 changes applied) which is inspired by the > > precise > > > but accessible style and language of the Raft paper. The idea is that it > > > makes it easier for contributors and anyone else interested in the > > protocol > > > to learn how it works, the fundamental properties it has and how those > > > properties are supported by the various behaviors and conditions. > > > > > > It currently resides next to the TLA+ specification itself in my > > > kafka-tlaplus repository. I'd be interested to receive feedback from the > > > community. > > > > > > > > > > > https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/kip-966/description/0_kafka_replication_protocol.md > > > > > > Thanks > > > Jack > > > > > > > > -- > > Okada Haruki > ocadar...@gmail.com >
Re: Complete Kafka replication protocol description
Hi Jack, Thank you for the great work, not only the spec but also for the comprehensive documentation about the replication. Actually I wrote some TLA+ spec to verify unclean leader election behavior before so I will double-check my understanding with your complete spec :) Thanks, 2023年9月10日(日) 21:42 David Jacot : > Hi Jack, > > This is great! Thanks for doing it. I will look into it when I have a bit > of time, likely after Current. > > Would you be interested in contributing it to the main repository? That > would be a great contribution to the project. Having it there would allow > the community to maintain it while changes to the protocol are made. That > would also pave the way for having other specs in the future (e.g. new > consumer group protocol). > > Best, > David > > Le dim. 10 sept. 2023 à 12:45, Jack Vanlightly a > écrit : > > > Hi all, > > > > As part of my work on formally verifying different parts of Apache Kafka > > and working on KIP-966 I have built up a lot of knowledge about how the > > replication protocol works. Currently it is mostly documented across > > various KIPs and in the code itself. I have written a complete protocol > > description (with KIP-966 changes applied) which is inspired by the > precise > > but accessible style and language of the Raft paper. The idea is that it > > makes it easier for contributors and anyone else interested in the > protocol > > to learn how it works, the fundamental properties it has and how those > > properties are supported by the various behaviors and conditions. > > > > It currently resides next to the TLA+ specification itself in my > > kafka-tlaplus repository. I'd be interested to receive feedback from the > > community. > > > > > > > https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/kip-966/description/0_kafka_replication_protocol.md > > > > Thanks > > Jack > > > -- Okada Haruki ocadar...@gmail.com
Re: Complete Kafka replication protocol description
Hi Jack, This is great! Thanks for doing it. I will look into it when I have a bit of time, likely after Current. Would you be interested in contributing it to the main repository? That would be a great contribution to the project. Having it there would allow the community to maintain it while changes to the protocol are made. That would also pave the way for having other specs in the future (e.g. new consumer group protocol). Best, David Le dim. 10 sept. 2023 à 12:45, Jack Vanlightly a écrit : > Hi all, > > As part of my work on formally verifying different parts of Apache Kafka > and working on KIP-966 I have built up a lot of knowledge about how the > replication protocol works. Currently it is mostly documented across > various KIPs and in the code itself. I have written a complete protocol > description (with KIP-966 changes applied) which is inspired by the precise > but accessible style and language of the Raft paper. The idea is that it > makes it easier for contributors and anyone else interested in the protocol > to learn how it works, the fundamental properties it has and how those > properties are supported by the various behaviors and conditions. > > It currently resides next to the TLA+ specification itself in my > kafka-tlaplus repository. I'd be interested to receive feedback from the > community. > > > https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/kip-966/description/0_kafka_replication_protocol.md > > Thanks > Jack >
Complete Kafka replication protocol description
Hi all, As part of my work on formally verifying different parts of Apache Kafka and working on KIP-966 I have built up a lot of knowledge about how the replication protocol works. Currently it is mostly documented across various KIPs and in the code itself. I have written a complete protocol description (with KIP-966 changes applied) which is inspired by the precise but accessible style and language of the Raft paper. The idea is that it makes it easier for contributors and anyone else interested in the protocol to learn how it works, the fundamental properties it has and how those properties are supported by the various behaviors and conditions. It currently resides next to the TLA+ specification itself in my kafka-tlaplus repository. I'd be interested to receive feedback from the community. https://github.com/Vanlightly/kafka-tlaplus/blob/main/kafka_data_replication/kraft/kip-966/description/0_kafka_replication_protocol.md Thanks Jack