[ 
https://issues.apache.org/jira/browse/CASSANDRA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021122#comment-13021122
 ] 

Peter Schuller commented on CASSANDRA-2405:
-------------------------------------------

A further complication: Since the intent here is to enable people to set up 
alarms to trigger whenever the time-since-last is not within an acceptable 
range, it raises the issue of whether to keep this information persistent in 
system tables or just in-memory. Keeping in mind that:

(1) For large amounts of data the act of doing another round of AES "just in 
case" if a node was restarted is significant
(2) If the alarm were to triggered on the information not being available, that 
would instantly lead to false positive alarms when nodes are restarted, 
instantly rendering alarms useless to operations.
(3) If the alarm were to ignore the case where the information is not yet 
available, that is a very dangerous silent failure and effectively means the 
alarm is not functioning properly.

... I get the feeling one wants this information persistent.

I guess this all makes the ticket non-trivial, but I think the need for an 
"easy" way for operators to ensure sufficient AES frequency is important.

(I'm actually kind of surprised issues with this do not crop up more often on 
the mailing lists... am I missing something that mitigates the impact here, or 
are people just using sufficiently long grace periods relative to repair 
frequency that they're not hitting these things in practice?)

> should expose 'time since last successful repair' for easier aes monitoring
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2405
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2405
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: CASSANDRA-2405.patch
>
>
> The practical implementation issues of actually ensuring repair runs is 
> somewhat of an undocumented/untreated issue.
> One hopefully low hanging fruit would be to at least expose the time since 
> last successful repair for a particular column family, to make it easier to 
> write a correct script to monitor for lack of repair in a non-buggy fashion.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to