Re: [DISCUSS] KIP-928: Making Kafka resilient to log directories becoming full

Christo Lolov Mon, 05 Jun 2023 02:58:27 -0700

Heya Igor,

Thank you for reading through the KIP and providing feedback!

11. Good question. I will check whether a change is needed in the
processing of the metadata records and come back. My hunch says no as long
as the Kafka broker is still alive to process the metadata records. This
being said, deleting topics is one of the two things I want to achieve. The
other one is to allow retention to be changed and continue to take effect.
As an example, if a person does not want to lose all data, but has realised
that they are storing 7 days of data while the only need the last 1 day
they should be able to make the retention more aggressive and recover space
without deleting the topic. In my opinion, the change to the controller for
ZK mode isn't big - where previously requests were sent only to online
replicas they are now sent to all replicas. I have a preference for it to
make it in, but if reviewers don't find it necessary I am happy to target
just KRaft.

12. Great question! Since the KIP aims to be as non-invasive as possible,
the controller has no knowledge of the saturated state - the brokers do not
propagate any new information. As such they will be reported as having
thrown a KafkaStorageException whenever DescribeReplicaLogDirs is called.
Again, this decision came from me wanting the change to be as least
invasive as possible - the new state could be propagated.

13. Yes, I forgot to add this to the KIP and will amend it in the upcoming
days. I was planning on proposing a metric similar to
kafka.log:type=LogManager,name=OfflineLogDirectoryCount, except that it
will show the count of SaturatedLogDirectory.

14. Great question and I will clarify this in the KIP! No, similarly to
getting out the offline state getting out of the saturated state once space
has been reclaimed would require a bounce of the broker. I have a want
should the KIP be accepted to build upon the proposal to allow
auto-recovery without the need of a restart.

Best,
Christo

On Fri, 2 Jun 2023 at 17:02, Igor Soarez <[email protected]> wrote:

> Hi Christo,
>
> Thank you for the KIP. Kafka is very sensitive to filesystem errors,
> and at the first IO error the whole log directory is permanently
> considered offline. It seems your proposal aims to increase the
> robustness of Kafka, and that's a positive improvement.
>
> I have some questions:
>
> 11. "Instead of sending a delete topic request only to replicas we
> know to be online, we will allow a delete topic request to be sent
> to all replicas regardless of their state. Previously a controller
> did not send delete topic requests to brokers because it knew they
> would fail. In the future, topic deletions for saturated topics will
> succeed, but topic deletions for the offline scenario will continue
> to fail." It seems you're describing ZK mode behavior? In KRaft
> mode the Controller does not send requests to Brokers. Instead
> the Controller persists new metadata records which all online Brokers
> then fetch. Since it's too late to be proposing design changes for
> ZK mode, is this change necessary? Is there a difference in how the
> metadata records should be processed by Brokers?
>
> 12. "We will add a new state to the broker state machines of a log
> directory (saturated) and a partition replica (saturated)."
> How are log directories and partitions replicas in these states
> represented in the Admin API? e.g. `DescribeReplicaLogDirs`
>
> 13. Should there be any metrics indicating the new saturated state for
> log directories and replicas?
>
> 14. "If an IOException due to No space left on device is raised (we
> will check the remaining space at that point in time rather than the
> exception message) the broker will stop all operations on logs
> located in that directory, remove all fetchers and stop compaction.
> Retention will continue to be respected. The same node as the
> current state will be written to in Zookeeper. All other
> IOExceptions will continue to be treated the same way they are
> treated now and will result in a log directory going offline."
> Does a log directory in this "saturated" state transition back to
> online if more storage space becomes available, e.g. due to
> retention policy enforcement or due to topic deletion, or does the
> Broker still require a restart to bring the log directory back to
> full operation?
>
> Best,
>
> --
> Igor
>
>
>

Re: [DISCUSS] KIP-928: Making Kafka resilient to log directories becoming full

Reply via email to