[
https://issues.apache.org/jira/browse/CASSANDRA-20363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931422#comment-17931422
]
Stefan Miklosovic edited comment on CASSANDRA-20363 at 2/28/25 10:31 AM:
-------------------------------------------------------------------------
[~tommy_s] try to take a look at this (1)
(1) https://github.com/apache/cassandra/pull/3933/files
What I did there is that I extracted all "failure logic" to one class. So your
task would be to code up another implementation of DiskErrorObserver and act
accordingly. Everything will go to that instance so detecting multiple errors
coming from different methods and starting / stopping monitoring threads should
be way easier. I can imagine that if an error comes for the first time and no
monitoring thread is running, you will start one but once another error comes
(possibly to another method), and you see that monitoring thread already runs,
you will not do anything. Then if a disk comes up again, you start the services
from that thread and you quit the thread.
I also think that for a comfortable distinguishing what disk is affected, we
should propagate path to a file that error is related to to all exceptions. I
am not sure if it is the case or not already.
It makes the code clear as well if we just centralize all of that to one place.
Try to focus just on the implementation of that, we may always make it
configurable later on when you are done with your proof of concept.
cc [~brandon.williams]
was (Author: smiklosovic):
[~tommy_s] try to take a look at this (1)
(1) https://github.com/apache/cassandra/pull/3933/files
What I did there is that I extracted all "failure logic" to one class. So your
task would be to code up another implementation of DiskErrorObserver and act
accordingly. Everything will go to that instance so detecting multiple errors
coming from different methods and starting / stopping monitoring threads should
be way easier. I can imagine that if an error comes for the first time and no
monitoring thread is running, you will start one but once another error comes
(possibly to another method), and you see that monitoring thread already runs,
you will not do anything. Then if a disk comes up again, you start the services
from that thread and you quit the thread.
It makes the code clear as well if we just centralize all of that to one place.
Try to focus just on the implementation of that, we may always make it
configurable later on when you are done with your proof of concept.
cc [~brandon.williams]
> Introduce a robust way to intercept FSError and commit log errors
> -----------------------------------------------------------------
>
> Key: CASSANDRA-20363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20363
> Project: Apache Cassandra
> Issue Type: New Feature
> Components: Legacy/Core
> Reporter: Tommy Stendahl
> Assignee: Tommy Stendahl
> Priority: Normal
> Fix For: 5.x
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Add java property to override the DefaultFSErrorHandler with a custom
> implementation.
> The use case I am looking at is a customer deployment that are using network
> disks and these can go off-line sometimes, I would like to use
> "disk_failure_policy: stop" but automatically detect when the disk is on-line
> again and just open gossip and transports so the nodes comes back UP without
> triggering a restart of the node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]