[ 
https://issues.apache.org/jira/browse/HBASE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach York updated HBASE-20723:
------------------------------
    Description: 
Description:

When custom hbase.wal.dir is configured the recovery system uses it in place of 
the HBase root dir and thus constructs an incorrect path for recovered edits 
when splitting WALs. This causes the recovery code in Region Servers to believe 
there are no recovered edits to replay, which causes a loss of writes that had 
not flushed prior to loss of a server.

 

Reproduction:

This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
1.1.2.2.6.3.2-14 

By default the underlying data is going to wasb://xxxxx@yyyyy/hbase 
 I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
/mnt.

hbase.wal.dir= hdfs://mycluster/walontest

hbase.wal.dir.perms=700

hbase.rootdir.perms=700

hbase.rootdir= 
wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase

Procedure to reproduce this issue:

1. create a table in hbase shell

2. insert a row in hbase shell

3. reboot the VM which hosts that region

4. scan the table in hbase shell and it is empty

Looking at the region server logs:
{code:java}
2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
wal.WALSplitter: This region's directory doesn't exist: 
hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
It is very likely that it was already split so it's safe to discard those edits.

{code}
The log split/replay ignored actual WAL due to WALSplitter is looking for the 
region directory in the hbase.wal.dir we specified rather than the 
hbase.rootdir.

Looking at the source code,
 
[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java]
 it uses the rootDir, which is walDir, as the tableDir root path.

So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
even in different filesystem, then the #5 uses walDir as tableDir is apparently 
wrong.

CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.

  was:
This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
1.1.2.2.6.3.2-14 

By default the underlying data is going to wasb://xxxxx@yyyyy/hbase 
 I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
/mnt.

hbase.wal.dir= hdfs://mycluster/walontest

hbase.wal.dir.perms=700

hbase.rootdir.perms=700

hbase.rootdir= 
wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase

Procedure to reproduce this issue:

1. create a table in hbase shell

2. insert a row in hbase shell

3. reboot the VM which hosts that region

4. scan the table in hbase shell and it is empty

Looking at the region server logs:
{code:java}
2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
wal.WALSplitter: This region's directory doesn't exist: 
hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
It is very likely that it was already split so it's safe to discard those edits.

{code}
The log split/replay ignored actual WAL due to WALSplitter is looking for the 
region directory in the hbase.wal.dir we specified rather than the 
hbase.rootdir.

Looking at the source code,
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java
 it uses the rootDir, which is walDir, as the tableDir root path.

So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
even in different filesystem, then the #5 uses walDir as tableDir is apparently 
wrong.

CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.


> Custom hbase.wal.dir results in dataloss because we write recovered edits 
> into a different place than where the recovering region server looks for them.
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-20723
>                 URL: https://issues.apache.org/jira/browse/HBASE-20723
>             Project: HBase
>          Issue Type: Bug
>          Components: Recovery, wal
>    Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 2.0.0
>            Reporter: Rohan Pednekar
>            Assignee: Ted Yu
>            Priority: Critical
>         Attachments: 20723.v1.txt, 20723.v2.txt, 20723.v3.txt, 20723.v4.txt, 
> 20723.v5.txt, 20723.v5.txt, 20723.v6.txt, 20723.v7.txt, 20723.v8.txt, 
> 20723.v9.txt, logs.zip
>
>
> Description:
> When custom hbase.wal.dir is configured the recovery system uses it in place 
> of the HBase root dir and thus constructs an incorrect path for recovered 
> edits when splitting WALs. This causes the recovery code in Region Servers to 
> believe there are no recovered edits to replay, which causes a loss of writes 
> that had not flushed prior to loss of a server.
>  
> Reproduction:
> This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 
> 1.1.2.2.6.3.2-14 
> By default the underlying data is going to wasb://xxxxx@yyyyy/hbase 
>  I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at 
> /mnt.
> hbase.wal.dir= hdfs://mycluster/walontest
> hbase.wal.dir.perms=700
> hbase.rootdir.perms=700
> hbase.rootdir= 
> wasb://XYZ[@hbaseperf.core.net|mailto:duohbase5ds...@duohbaseperf.blob.core.windows.net]/hbase
> Procedure to reproduce this issue:
> 1. create a table in hbase shell
> 2. insert a row in hbase shell
> 3. reboot the VM which hosts that region
> 4. scan the table in hbase shell and it is empty
> Looking at the region server logs:
> {code:java}
> 2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] 
> wal.WALSplitter: This region's directory doesn't exist: 
> hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. 
> It is very likely that it was already split so it's safe to discard those 
> edits.
> {code}
> The log split/replay ignored actual WAL due to WALSplitter is looking for the 
> region directory in the hbase.wal.dir we specified rather than the 
> hbase.rootdir.
> Looking at the source code,
>  
> [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java]
>  it uses the rootDir, which is walDir, as the tableDir root path.
> So if we use HBASE-17437, waldir and hbase rootdir are in different path or 
> even in different filesystem, then the #5 uses walDir as tableDir is 
> apparently wrong.
> CC: [~zyork], [~yuzhih...@gmail.com] Attached the logs for quick review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to