[ 
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954136#comment-16954136
 ] 

Andrew Wong commented on KUDU-2975:
-----------------------------------

I agree, this would be great to have.

Most of my exploration revolved around duplicating tablet/consensus metadata 
across multiple disks in order to prevent split brain scenarios in Raft. The 
main issue in spreading metadata across disks is that, for correctness, we 
should never "forget" about the existence of a tablet replica, otherwise it can 
lead to tricky consensus issues. It's been a while since I've thought through 
whether that is still the case if we keep the metadata but lose the WAL (e.g. 
in the case of a disk failure).

As a thought experiment, let's assume we keep the metadata directories in a 
single, separate directory, but decide to spread the WALs across directories. 
If a single WAL directory fails/is removed, how should we handle this?
* While the server is still running, we can treat this like a "data directory" 
disk failure: fail the tablet replicas that had anything on the failed disk, 
and let the Master handle the rest, with respect to re-replication and eviction.
* When the server is restarted, if the failed WAL disk has been _removed_, we 
will still have metadata for many tablets on the server, but we will not have a 
WAL for those tablets. We will need to handle this case, probably by _marking 
those tablets as failed_.

It's worth thinking about this from a Raft perspective. I haven't proven to 
myself that it's safe to startup with metadata but no WALs, but it might be, in 
which case the failure of a single WAL disk would only affect ~1/12 tablets 
hosted on the tserver.

That said, an alternative approach would be to simply crash the server (as we 
do today) if we fail to write to any of the WALs, which would be safe, though 
would lead to re-replication of the entire tserver.

> Spread WAL across multiple data directories
> -------------------------------------------
>
>                 Key: KUDU-2975
>                 URL: https://issues.apache.org/jira/browse/KUDU-2975
>             Project: Kudu
>          Issue Type: New Feature
>          Components: fs, tablet, tserver
>            Reporter: LiFu He
>            Priority: Major
>         Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we 
> created a big table and loaded data to it through flink.  We noticed that the 
> util of one SSD which is used to store WAL is 100% but others are free. So, 
> we suggest to spread WAL across multiple data directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to