[jira] [Comment Edited] (CASSANDRA-20363) Introduce a robust way to intercept FSError and commit log errors

Stefan Miklosovic (Jira) Fri, 28 Feb 2025 02:46:50 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931422#comment-17931422
 ]


Stefan Miklosovic edited comment on CASSANDRA-20363 at 2/28/25 10:36 AM:
-------------------------------------------------------------------------

[~tommy_s] try to take a look at this (1)

(1) https://github.com/apache/cassandra/pull/3933/files

What I did there is that I extracted all "failure logic" to one class. So your 
task would be to code up another implementation of DiskErrorObserver and act 
accordingly. Everything will go to that instance so detecting multiple errors 
coming from different methods and starting / stopping monitoring threads should 
be way easier.

There should be one thread started per disk. I can imagine that if an error 
comes for the first time and no monitoring thread for a particular disk is 
running, you will start one but once another error comes (possibly to another 
method), and you see that monitoring thread already runs, you will not do 
anything. Then if a disk comes up again, you start the services from that 
thread and you quit the thread.

I also think that for a comfortable distinguishing what disk is affected, we 
should propagate path to a file that error is related to to all exceptions. I 
am not sure if it is the case or not already.

edit: I see that FSErrorHandler is dealing with exceptions where each can get 
an access to a file affected.

It makes the code clear as well if we just centralize all of that to one place.

Try to focus just on the implementation of that, we may always make it 
configurable later on when you are done with your proof of concept.

cc [~brandon.williams]


was (Author: smiklosovic):
[~tommy_s] try to take a look at this (1)

(1) https://github.com/apache/cassandra/pull/3933/files

What I did there is that I extracted all "failure logic" to one class. So your 
task would be to code up another implementation of DiskErrorObserver and act 
accordingly. Everything will go to that instance so detecting multiple errors 
coming from different methods and starting / stopping monitoring threads should 
be way easier.

There should be one thread started per disk. I can imagine that if an error 
comes for the first time and no monitoring thread is running, you will start 
one but once another error comes (possibly to another method), and you see that 
monitoring thread already runs, you will not do anything. Then if a disk comes 
up again, you start the services from that thread and you quit the thread.

I also think that for a comfortable distinguishing what disk is affected, we 
should propagate path to a file that error is related to to all exceptions. I 
am not sure if it is the case or not already.

edit: I see that FSErrorHandler is dealing with exceptions where each can get 
an access to a file affected.

It makes the code clear as well if we just centralize all of that to one place.

Try to focus just on the implementation of that, we may always make it 
configurable later on when you are done with your proof of concept.

cc [~brandon.williams]

> Introduce a robust way to intercept FSError and commit log errors
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-20363
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20363
>             Project: Apache Cassandra
>          Issue Type: New Feature
>          Components: Legacy/Core
>            Reporter: Tommy Stendahl
>            Assignee: Tommy Stendahl
>            Priority: Normal
>             Fix For: 5.x
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Add java property to override the DefaultFSErrorHandler with a custom 
> implementation.
> The use case I am looking at is a customer deployment that are using network 
> disks and these can go off-line sometimes, I would like to use 
> "disk_failure_policy: stop" but automatically detect when the disk is on-line 
> again and just open gossip and transports so the nodes comes back UP without 
> triggering a restart of the node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-20363) Introduce a robust way to intercept FSError and commit log errors

Reply via email to