[ https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761694#comment-16761694 ]
Duo Zhang commented on HBASE-21843: ----------------------------------- {quote} I meant if changing AbstractFSWALProvider.isMetaFile() "if" statement to "p.indexOf(META_WAL_PROVIDER_ID) >=0" instead of "p.endsWith(META_WAL_PROVIDER_ID)" would be less risky {quote} I know, and my point here is 1. The code in RegionGroupingProvider is confusing, the way we generate providerId does not consider meta but later in other methods we do consider meta. We should make them consistent. 2. Changing the AbstractFSWALProvider.isMetaFile may have other side effect, for example, if you just use indexOf, what if your hostname contains 'meta'? This may introduce other data loss problems... In general, I do not think multi wal needs to support meta region, as there is no advantage to enable multi wal for meta right? We have only one meta region... But since this is a data loss issue, let's first get it done quickly. Can open other issues to fix the remaining problems in the future. > RegionGroupingProvider breaks the meta wal file name pattern which may cause > data loss for meta region > ------------------------------------------------------------------------------------------------------ > > Key: HBASE-21843 > URL: https://issues.apache.org/jira/browse/HBASE-21843 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 3.0.0, 2.1.0, 2.2.0 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Blocker > Labels: data-loss > Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0 > > Attachments: HBASE-21843.master.001.patch, HBASE-21843.patch > > > A bit unusual, but managed to face this twice lately on both distributed and > local standalone mode, on VMs. Somehow, after some VM pause/resume, got into > a situation where regions on meta were assigned to a give RS startcode that > had no corresponding WAL dir. > That caused those regions to never get assigned, because the given RS > startcode is not found anywhere by RegionServerTracker/ServerManager, so no > SCP is created to this RS startcode, leaving the region "open" on a dead > server forever, in META. > Could get this sorted by adding extra check on loadMeta, checking if the RS > assigned to the region in meta is not online and doesn't have a WAL dir, then > mark this region as offline. -- This message was sent by Atlassian JIRA (v7.6.3#76005)