[ 
https://issues.apache.org/jira/browse/CASSANDRA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-2118:
--------------------------------------

    Attachment: 2118-tweaked.txt

Looks reasonable.  Tweaked version attached w/ some minor cleanup.

Other things worth addressing:
- Is there a reason for the FSError.Op enum?  Looks like we don't need it if we 
just use instanceof instead in handleFSError.
- Instead of trying to catch all the places we iterate sstables, what about 
either (1) removing unreadable sstables in 
DataTracker.get[Uncompacting]SSTables or (2) ripping them out of DataTracker 
when we handle the error?  Either of those seems more foolproof to me.
- Would be nice to persist the blacklisted sstables somehow.  Maybe write a 
copy to each (other) data directory, so we don't try to read sstables that 
we've blacklisted, after a restart?
- May be worth adding another option: best_effort_with_repair, where when we 
detect an unreadable disk we kick off a repair to rebuild that data 
automatically.
                
> Provide failure modes if issues with the underlying filesystem of a node
> ------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2118
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2118
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Aleksey Yeschenko
>             Fix For: 1.2
>
>         Attachments: 
> 0001-Provide-failure-modes-if-issues-with-the-underlying-.patch, 
> 0001-Provide-failure-modes-if-issues-with-the-underlying-v2.patch, 
> 0001-Provide-failure-modes-if-issues-with-the-underlying-v3.patch, 
> 2118-tweaked.txt, CASSANDRA-2118-part1.patch, CASSANDRA-2118-v1.patch
>
>
> CASSANDRA-2116 introduces the ability to detect FS errors. Let's provide a 
> mode in cassandra.yaml so operators can decide that in the event of failure 
> what to do:
> 1) standard - means continue on all errors (default)
> 2) read - means only stop  gossip/rpc server if 'reads' fail from drive, 
> writes can fail but not kill gossip/rpc server
> 3) readwrite - means stop gossip/rpc server if any read or write errors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to