[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180173#comment-14180173 ] Benjamin Lerer commented on CASSANDRA-7927: --- 2 minor nits for me: * As {{JVMStabilityInspector}} as only static methods and instances, I would have make it final with a private constructor. * Personally I prefer to separate test related classes from production classes so I would have put KillerForTests with the tests but it is a matter of taste. Otherwise it is a +1 for me. Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.2 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174120#comment-14174120 ] Joshua McKenzie commented on CASSANDRA-7927: Another updated pushed to branch [here|https://github.com/josh-mckenzie/cassandra/compare/7927]; While working on CASSANDRA-7579 I noticed that the _die unit test was failing on linux (for entirely different reasons than the Windows failure). Digging into it a bit shows that the unit test it was based on, testCommitFailurePolicy_stop(), didn't actually do what it was intended to do. StorageService isn't initialized by SchemaLoader so the assertions to check on _stop test always passed. Also, changing a directory to write-only doesn't change the contents to being write-only so flushes would keep working even if the StorageService had been started. I've opened the interface on CommitLog.handleCommitError as public, marked it VisibleForTesting, and updated those 2 unit tests to check the logic specifically dealing with how our CommitLog system deals with throwables during stop and die policy settings. Tests pass on both Windows and linux now. Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169707#comment-14169707 ] Joshua McKenzie commented on CASSANDRA-7927: The previously linked branch actually had a couple of problems with it I've resolved [here|https://github.com/josh-mckenzie/cassandra/compare/7927?expand=1]. Namely, the when I combined the checking for FSError / CorruptSSTableException in inspectThrowable I didn't check the Commit log failure policy in the DatabaseDescriptor and also wouldn't have been able to do so without augmenting the information passed in to indicate it originated in a CommitLog context. I think you were on the right track w/having an independent entry point for inspection of CommitLog errors - that way we can kill the JVM on *any* commit log errors without having to worry about the type of error thrown on the CommitLog operation. I did a few other things on this branch as well: # added an entry in CHANGES.txt # added assertion to CommitLogTest to confirm the _die actually worked # added a workaround for the fact that File.setWritable(false) on a directory fails on Windows (/sigh) # merged the KillerForTests into the JVMStabilityInspector to help keep the code-base clean # promoted the inspection in FileUtils and in CommitLog of the Throwable to the root of (handleFSError/handleCorruptSSTable/handleCommitError) so the inspector will immediately kill if appropriate Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170210#comment-14170210 ] John Sumsion commented on CASSANDRA-7927: - LGTM Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166813#comment-14166813 ] John Sumsion commented on CASSANDRA-7927: - Looks great! The compromise on checking policy twice looks like it keeps the code less scattered. Thanks for doing the cleanup! Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165600#comment-14165600 ] Joshua McKenzie commented on CASSANDRA-7927: Sorry for the delay on this - I have a version rebased to 2.1 head [available here|https://github.com/josh-mckenzie/cassandra/compare/7927?expand=1] * Added support for die policy to CommitLog exception handling * Removed 'killMeNow' method in StabilityInspector * Migrated the FileUtil killing logic into the StabilityInspector * Slight refactor on JVMStabilityInspector to keep it single-point-of-entry (hand Throwable to it, let it deal with it) * Updated the unit tests to work w/the new structure * Removed erroneous added entries from Config.CommitFailurePolicy * Reverted ordering on enums in Config to just append the new entry on the end Regarding migrating the logic into the JVMStabilityInspector: I expect we're going to have very few exception conditions that will cause us to mark the JVM as unstable and kill it, so I'd prefer to keep that class as simple as possible and nest that logic inside it rather than distributing it throughout by opening the interface to a 'killMeNow' type method. Hand a throwable to it, it'll kill things if they need to be killed. [~jdsumsion]: could you review the revised branch posted above? Thanks! Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132919#comment-14132919 ] John Sumsion commented on CASSANDRA-7927: - NOTE: patch is based on CASSANDRA-7507, and adds tests for that patch also Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Labels: lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132930#comment-14132930 ] Joshua McKenzie commented on CASSANDRA-7927: On initial read through this LGTM. Once Aleksey's gotten to 7507 I'll rebase this, do a bit more thorough testing, and commit. Also: good call adding the unit tests for the Inspector! Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Labels: lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)