Re: [DISCUSS] KIP-928: Making Kafka resilient to log directories becoming full

Igor Soarez Fri, 02 Jun 2023 09:03:12 -0700

Hi Christo,

Thank you for the KIP. Kafka is very sensitive to filesystem errors,
and at the first IO error the whole log directory is permanently
considered offline. It seems your proposal aims to increase the
robustness of Kafka, and that's a positive improvement.


I have some questions:

11. "Instead of sending a delete topic request only to replicas we
know to be online, we will allow a delete topic request to be sent
to all replicas regardless of their state. Previously a controller
did not send delete topic requests to brokers because it knew they
would fail. In the future, topic deletions for saturated topics will
succeed, but topic deletions for the offline scenario will continue
to fail." It seems you're describing ZK mode behavior? In KRaft
mode the Controller does not send requests to Brokers. Instead
the Controller persists new metadata records which all online Brokers
then fetch. Since it's too late to be proposing design changes for
ZK mode, is this change necessary? Is there a difference in how the
metadata records should be processed by Brokers?

12. "We will add a new state to the broker state machines of a log
directory (saturated) and a partition replica (saturated)."
How are log directories and partitions replicas in these states
represented in the Admin API? e.g. `DescribeReplicaLogDirs`

13. Should there be any metrics indicating the new saturated state for
log directories and replicas?

14. "If an IOException due to No space left on device is raised (we
will check the remaining space at that point in time rather than the
exception message) the broker will stop all operations on logs
located in that directory, remove all fetchers and stop compaction.
Retention will continue to be respected. The same node as the
current state will be written to in Zookeeper. All other
IOExceptions will continue to be treated the same way they are
treated now and will result in a log directory going offline."
Does a log directory in this "saturated" state transition back to
online if more storage space becomes available, e.g. due to
retention policy enforcement or due to topic deletion, or does the
Broker still require a restart to bring the log directory back to
full operation?

Best,

--
Igor

Re: [DISCUSS] KIP-928: Making Kafka resilient to log directories becoming full

Reply via email to