Hi Christo, Thank you for the KIP. Kafka is very sensitive to filesystem errors, and at the first IO error the whole log directory is permanently considered offline. It seems your proposal aims to increase the robustness of Kafka, and that's a positive improvement.
I have some questions: 11. "Instead of sending a delete topic request only to replicas we know to be online, we will allow a delete topic request to be sent to all replicas regardless of their state. Previously a controller did not send delete topic requests to brokers because it knew they would fail. In the future, topic deletions for saturated topics will succeed, but topic deletions for the offline scenario will continue to fail." It seems you're describing ZK mode behavior? In KRaft mode the Controller does not send requests to Brokers. Instead the Controller persists new metadata records which all online Brokers then fetch. Since it's too late to be proposing design changes for ZK mode, is this change necessary? Is there a difference in how the metadata records should be processed by Brokers? 12. "We will add a new state to the broker state machines of a log directory (saturated) and a partition replica (saturated)." How are log directories and partitions replicas in these states represented in the Admin API? e.g. `DescribeReplicaLogDirs` 13. Should there be any metrics indicating the new saturated state for log directories and replicas? 14. "If an IOException due to No space left on device is raised (we will check the remaining space at that point in time rather than the exception message) the broker will stop all operations on logs located in that directory, remove all fetchers and stop compaction. Retention will continue to be respected. The same node as the current state will be written to in Zookeeper. All other IOExceptions will continue to be treated the same way they are treated now and will result in a log directory going offline." Does a log directory in this "saturated" state transition back to online if more storage space becomes available, e.g. due to retention policy enforcement or due to topic deletion, or does the Broker still require a restart to bring the log directory back to full operation? Best, -- Igor