[
https://issues.apache.org/jira/browse/KAFKA-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kamal Chandraprakash updated KAFKA-19981:
-----------------------------------------
Description:
RemoteLogManager does not distinguish between RemoteStorageException and
RetriableRemoteStorageException.
The plugin implementors of RemoteStorageManager might implement Circuit
breakers and throw RetriableRemoteStorageException exceptions when the object
storage degrades, we may have to handle the retriable error gracefully in the
segment copy / deletion path so that it does not pollute the broker logs.
1. When segment copy fails, then the RemoteLogManager issues a [deletion
call|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1048]
to delete that failed uploaded segment. When RemoteStorageManager throws
RetriableRemoteStorageException, then the deletion call can be avoided.
2. When RemoteStorageManager throws RetriableRemoteStorageException during the
deletion, then we can skip marking the
[failedRemoteDeleteRequestRate|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1517]
metric. This approach is similar to [copy failure behaviour
|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1005]
on retryable error.
was:
RemoteLogManager does not distinguish between RemoteStorageException and
RetriableRemoteStorageException.
The plugin implementors of RemoteStorageManager might implement Circuit
breakers and throw RetriableRemoteStorageException exceptions when the object
storage degrades, we may have to handle the retriable error gracefully in the
segment copy / deletion path so that it does not pollute the broker logs.
1. When segment copy fails, then the RemoteLogManager issues a [deletion|
callhttps://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1048]
to delete that failed uploaded segment. When RemoteStorageManager throws
RetriableRemoteStorageException, then the deletion call can be avoided.
2. When RemoteStorageManager throws RetriableRemoteStorageException during the
deletion, then we can skip marking the
[failedRemoteDeleteRequestRate|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1517]
metric. This approach is similar to [copy failure behaviour
|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1005]
on retryable error.
> Handle retriable remote storage exception in RemoteLogManager
> -------------------------------------------------------------
>
> Key: KAFKA-19981
> URL: https://issues.apache.org/jira/browse/KAFKA-19981
> Project: Kafka
> Issue Type: Task
> Components: Tiered-Storage
> Reporter: Kamal Chandraprakash
> Assignee: Lan Ding
> Priority: Minor
>
> RemoteLogManager does not distinguish between RemoteStorageException and
> RetriableRemoteStorageException.
> The plugin implementors of RemoteStorageManager might implement Circuit
> breakers and throw RetriableRemoteStorageException exceptions when the object
> storage degrades, we may have to handle the retriable error gracefully in the
> segment copy / deletion path so that it does not pollute the broker logs.
> 1. When segment copy fails, then the RemoteLogManager issues a [deletion
> call|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1048]
> to delete that failed uploaded segment. When RemoteStorageManager throws
> RetriableRemoteStorageException, then the deletion call can be avoided.
> 2. When RemoteStorageManager throws RetriableRemoteStorageException during
> the deletion, then we can skip marking the
> [failedRemoteDeleteRequestRate|https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1517]
> metric. This approach is similar to [copy failure behaviour
> |https://sourcegraph.com/github.com/apache/kafka/-/blob/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManager.java?L1005]
> on retryable error.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)