Mike Percy created KUDU-2792: -------------------------------- Summary: Automatically retry failed bootstrap on tablets that failed to start due to disk space Key: KUDU-2792 URL: https://issues.apache.org/jira/browse/KUDU-2792 Project: Kudu Issue Type: Task Components: tserver Affects Versions: 1.8.0 Reporter: Mike Percy
If a tablet replica fails to bootstrap due to insufficient disk space to replay the WAL, it will remain in a state that looks like this in ksck, even if the user frees up disk space: {code:java} 5edf82f0516b4897b3a7991a7e67d71c (host1.example.com:7050): not running [LEADER] State: FAILED Data state: TABLET_DATA_READY Last status: IO error: Failed log replay. Reason: Failed to open new log: Insufficient disk space to allocate 8388608 bytes under path /data/1/kudu/tablet/wal/wals/5807c5100e0d4522a66e32efbb29d57e/.kudutmp.newsegmentzGFKEg (7939936256 bytes available vs 19993874923 bytes reserved) (error 28) {code} Today, this requires a tablet server restart to recover from. It should be possible for a tablet server (i.e. the TsTabletManager) to detect that the failure was temporary, not permanent, and retry the failed bootstrap later on when additional disk space has been freed. From a programming perspective, that may require dealing with some object lifecycle issues (i.e. not reusing the Tablet object from the failed bootstrap). -- This message was sent by Atlassian JIRA (v7.6.3#76005)