[ https://issues.apache.org/jira/browse/CASSANDRA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-2118: -------------------------------------- Attachment: 2118-tweaked.txt Looks reasonable. Tweaked version attached w/ some minor cleanup. Other things worth addressing: - Is there a reason for the FSError.Op enum? Looks like we don't need it if we just use instanceof instead in handleFSError. - Instead of trying to catch all the places we iterate sstables, what about either (1) removing unreadable sstables in DataTracker.get[Uncompacting]SSTables or (2) ripping them out of DataTracker when we handle the error? Either of those seems more foolproof to me. - Would be nice to persist the blacklisted sstables somehow. Maybe write a copy to each (other) data directory, so we don't try to read sstables that we've blacklisted, after a restart? - May be worth adding another option: best_effort_with_repair, where when we detect an unreadable disk we kick off a repair to rebuild that data automatically. > Provide failure modes if issues with the underlying filesystem of a node > ------------------------------------------------------------------------ > > Key: CASSANDRA-2118 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2118 > Project: Cassandra > Issue Type: Sub-task > Components: Core > Reporter: Chris Goffinet > Assignee: Aleksey Yeschenko > Fix For: 1.2 > > Attachments: > 0001-Provide-failure-modes-if-issues-with-the-underlying-.patch, > 0001-Provide-failure-modes-if-issues-with-the-underlying-v2.patch, > 0001-Provide-failure-modes-if-issues-with-the-underlying-v3.patch, > 2118-tweaked.txt, CASSANDRA-2118-part1.patch, CASSANDRA-2118-v1.patch > > > CASSANDRA-2116 introduces the ability to detect FS errors. Let's provide a > mode in cassandra.yaml so operators can decide that in the event of failure > what to do: > 1) standard - means continue on all errors (default) > 2) read - means only stop gossip/rpc server if 'reads' fail from drive, > writes can fail but not kill gossip/rpc server > 3) readwrite - means stop gossip/rpc server if any read or write errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira