[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-22 Thread Benjamin Lerer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180173#comment-14180173
 ] 

Benjamin Lerer commented on CASSANDRA-7927:
---

2 minor nits for me:
* As {{JVMStabilityInspector}} as only static methods and instances, I would 
have make it final with a private constructor.
* Personally I prefer to separate test related classes from production classes 
so I would have put KillerForTests with the tests but it is a matter of taste.

Otherwise it is a +1 for me.

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.2

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-16 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174120#comment-14174120
 ] 

Joshua McKenzie commented on CASSANDRA-7927:


Another updated pushed to branch 
[here|https://github.com/josh-mckenzie/cassandra/compare/7927];

While working on CASSANDRA-7579 I noticed that the _die unit test was failing 
on linux (for entirely different reasons than the Windows failure). Digging 
into it a bit shows that the unit test it was based on, 
testCommitFailurePolicy_stop(), didn't actually do what it was intended to do. 
StorageService isn't initialized by SchemaLoader so the assertions to check on 
_stop test always passed. Also, changing a directory to write-only doesn't 
change the contents to being write-only so flushes would keep working even if 
the StorageService had been started.

I've opened the interface on CommitLog.handleCommitError as public, marked it 
VisibleForTesting, and updated those 2 unit tests to check the logic 
specifically dealing with how our CommitLog system deals with throwables during 
stop and die policy settings.  Tests pass on both Windows and linux now.

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-13 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169707#comment-14169707
 ] 

Joshua McKenzie commented on CASSANDRA-7927:


The previously linked branch actually had a couple of problems with it I've 
resolved 
[here|https://github.com/josh-mckenzie/cassandra/compare/7927?expand=1].  
Namely, the when I combined the checking for FSError / CorruptSSTableException 
in inspectThrowable I didn't check the Commit log failure policy in the 
DatabaseDescriptor and also wouldn't have been able to do so without augmenting 
the information passed in to indicate it originated in a CommitLog context.  I 
think you were on the right track w/having an independent entry point for 
inspection of CommitLog errors - that way we can kill the JVM on *any* commit 
log errors without having to worry about the type of error thrown on the 
CommitLog operation.

I did a few other things on this branch as well:
# added an entry in CHANGES.txt
# added assertion to CommitLogTest to confirm the _die actually worked
# added a workaround for the fact that File.setWritable(false) on a directory 
fails on Windows (/sigh)
# merged the KillerForTests into the JVMStabilityInspector to help keep the 
code-base clean
# promoted the inspection in FileUtils and in CommitLog of the Throwable to the 
root of (handleFSError/handleCorruptSSTable/handleCommitError) so the inspector 
will immediately kill if appropriate

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-13 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170210#comment-14170210
 ] 

John Sumsion commented on CASSANDRA-7927:
-

LGTM

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-10 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166813#comment-14166813
 ] 

John Sumsion commented on CASSANDRA-7927:
-

Looks great!  The compromise on checking policy twice looks like it keeps the 
code less scattered.

Thanks for doing the cleanup!

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-09 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165600#comment-14165600
 ] 

Joshua McKenzie commented on CASSANDRA-7927:


Sorry for the delay on this -  I have a version rebased to 2.1 head [available 
here|https://github.com/josh-mckenzie/cassandra/compare/7927?expand=1]
* Added support for die policy to CommitLog exception handling
* Removed 'killMeNow' method in StabilityInspector
* Migrated the FileUtil killing logic into the StabilityInspector
* Slight refactor on JVMStabilityInspector to keep it single-point-of-entry 
(hand Throwable to it, let it deal with it)
* Updated the unit tests to work w/the new structure
* Removed erroneous added entries from Config.CommitFailurePolicy
* Reverted ordering on enums in Config to just append the new entry on the end

Regarding migrating the logic into the JVMStabilityInspector: I expect we're 
going to have very few exception conditions that will cause us to mark the JVM 
as unstable and kill it, so I'd prefer to keep that class as simple as possible 
and nest that logic inside it rather than distributing it throughout by opening 
the interface to a 'killMeNow' type method.  Hand a throwable to it, it'll kill 
things if they need to be killed.

[~jdsumsion]: could you review the revised branch posted above?  Thanks!

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-09-13 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132919#comment-14132919
 ] 

John Sumsion commented on CASSANDRA-7927:
-

NOTE: patch is based on CASSANDRA-7507, and adds tests for that patch also

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
  Labels: lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-09-13 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132930#comment-14132930
 ] 

Joshua McKenzie commented on CASSANDRA-7927:


On initial read through this LGTM.  Once Aleksey's gotten to 7507 I'll rebase 
this, do a bit more thorough testing, and commit.

Also: good call adding the unit tests for the Inspector!

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
  Labels: lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)