[ https://issues.apache.org/jira/browse/CASSANDRA-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan King updated CASSANDRA-809: -------------------------------- Fix Version/s: (was: 0.8) 1.0 > Full disk can result in being marked down > ----------------------------------------- > > Key: CASSANDRA-809 > URL: https://issues.apache.org/jira/browse/CASSANDRA-809 > Project: Cassandra > Issue Type: Bug > Reporter: Ryan King > Priority: Minor > Fix For: 1.0 > > > We had a node file up the disk under one of two data directories. The result > was that the node stopped making progress. The problem appears to be this > (I'll update with more details as we find them): > When new tasks are put onto most queues in Cassandra, if there isn't a thread > in the pool to handle the task immediately, the task in run in the caller's > thread > (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the > caller-runs policy). The queue in question here is the queue that manages > flushes, which is enqueued to from various places in our code (and therefore > likely from multiple threads). Assuming that the full disk meant that no > threads doing flushing could make progress (it appears that way) eventually > any thread that calls the flush code would become stalled. > Assuming our analysis is right (and we're still looking into it) we need to > make a change. Here's a proposal so far: > SHORT TERM: > * change the TheadPoolExecutor policy to not be caller runs. This will let > other threads make progress in the event that one pool is stalled > LONG TERM > * It appears that there are n threads for n data directories that we flush > to, but they're not dedicated to a data directory. We should have a thread > per data directory and have that thread dedicated to that directory > * Perhaps we could use the failure detector on disks? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira