[ 
https://issues.apache.org/jira/browse/KUDU-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Martonka reassigned KUDU-3458:
-------------------------------------

    Assignee: Zoltan Martonka

> Continue loading other tablets even if metadata for some tablets failed to 
> load
> -------------------------------------------------------------------------------
>
>                 Key: KUDU-3458
>                 URL: https://issues.apache.org/jira/browse/KUDU-3458
>             Project: Kudu
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Alexey Serbin
>            Assignee: Zoltan Martonka
>            Priority: Major
>              Labels: scalability, supportability, troubleshooting
>
> kudu-tserver stops tablet bootstrapping if a single tablet's metadata failed 
> to load (the kudu-tserver process exits on such an event, but with caveat of 
> KUDU-3419).
> This current behavior requires manual intervention.  In most cases, the 
> reason behind the failure to load tablet metadata is corrupted metadata file. 
>  The suspect behind such a corruption is a power failure, kernel panic, etc. 
> where opened file isn't synced.
> In case of a cluster with many tablet servers, where RF=3, if majority of 
> tablet replicas is present, such a situation with corrupted file could be 
> addressed automatically if the tablet server would continue bootstrapping of 
> other tablet replicas and eventually registered with Kudu masters.  The 
> system catalog would detect that the tablet is under-replicated because one 
> replica isn't running, and would re-replicate it elsewhere, sending 
> DELETE_TABLET for the tablet replica that has the corrupted metadata file.  
> That'd be similar to what happens if a consensus metadata for a tablet 
> replica were corrupted.
> It's necessary to update the code in {{TSTabletManager}} and allow 
> {{TSTabletManager::Init()}} to complete successfully in such case, marking 
> corresponding tablet replicas as failed to load (similar to what's done in 
> case of replica's consensus metadata).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to