Mike Percy created KUDU-2792:
--------------------------------

             Summary: Automatically retry failed bootstrap on tablets that 
failed to start due to disk space
                 Key: KUDU-2792
                 URL: https://issues.apache.org/jira/browse/KUDU-2792
             Project: Kudu
          Issue Type: Task
          Components: tserver
    Affects Versions: 1.8.0
            Reporter: Mike Percy


If a tablet replica fails to bootstrap due to insufficient disk space to replay 
the WAL, it will remain in a state that looks like this in ksck, even if the 
user frees up disk space:

 
{code:java}
5edf82f0516b4897b3a7991a7e67d71c (host1.example.com:7050): not running [LEADER]
 State: FAILED
 Data state: TABLET_DATA_READY
 Last status: IO error: Failed log replay. Reason: Failed to open new log: 
Insufficient disk space to allocate 8388608 bytes under path 
/data/1/kudu/tablet/wal/wals/5807c5100e0d4522a66e32efbb29d57e/.kudutmp.newsegmentzGFKEg
 (7939936256 bytes available vs 19993874923 bytes reserved) (error 28)
{code}
Today, this requires a tablet server restart to recover from.

It should be possible for a tablet server (i.e. the TsTabletManager) to detect 
that the failure was temporary, not permanent, and retry the failed bootstrap 
later on when additional disk space has been freed. From a programming 
perspective, that may require dealing with some object lifecycle issues (i.e. 
not reusing the Tablet object from the failed bootstrap).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to