Wellington Chevreuil created HBASE-23694:
--------------------------------------------

             Summary: After RegionProcedureStore completes migration of 
WALProcedureStore, still running WALProcedureStore.syncThread keeps trying to 
delete now inexistent log files.  
                 Key: HBASE-23694
                 URL: https://issues.apache.org/jira/browse/HBASE-23694
             Project: HBase
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


With the introduction of the new procedure store to save procs on a table, 
instead of proc WALs, we added some migration logic to enable a smooth upgrade 
of clusters under previous versions that used to have proc wal files as the 
underlying procedure store.

While running a fresh install with latest master branch version, I had noticed 
that once the old _WALProcedureStore_ directory is detected, it creates and 
start a WPS instance 
[here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L311],
 passing a built-in _ProcedureLoader_ 
[here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L315]
 to perform the conversion from proc wal files to table based.

At the end of this conversion, the builtin loader deletes the proc wal dir 
[here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java#L363].
 This may cause a race condition where a running _WALProcedureStore_ internal 
thread may be trying to delete the first proc wal file that may had already 
rolled, then failing with FNFE.

Even worse is that _WALProcedureStore.syncLoop_ keeps trying indefinitely, 
flooding Master logs with messages such as below:
{noformat}
2020-01-14 06:13:23,331 INFO  [master/hbase01:16000:becomeActiveMaster] 
region.RegionProcedureStore: The old WALProcedureStore wal directory 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs exists, migrating...
2020-01-14 06:13:23,361 DEBUG [master/hbase01:16000:becomeActiveMaster] 
wal.WALProcedureStore: Starting WAL Procedure Store lease recovery
2020-01-14 06:13:23,453 INFO  [master/hbase01:16000:becomeActiveMaster] 
wal.WALProcedureStore: Rolled new Procedure Store WAL, id=1
2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] 
wal.WALProcedureStore: Lease acquired for flushLogId=1
2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] 
wal.WALProcedureStore: No state logs to replay.
2020-01-14 06:13:23,454 INFO  [master/hbase01:16000:becomeActiveMaster] 
region.RegionProcedureStore: The WALProcedureStore max pid is 0, and the max 
pid of all loaded procedures is -1
2020-01-14 06:13:23,455 INFO  [master/hbase01:16000:becomeActiveMaster] 
region.RegionProcedureStore: Migration of WALProcedureStore finished
2020-01-14 06:13:23,455 INFO  [master/hbase01:16000:becomeActiveMaster] 
procedure2.ProcedureExecutor: Recovered RegionProcedureStore lease in 784 msec
2020-01-14 06:13:23,471 TRACE [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: no active procedures
2020-01-14 06:13:23,516 INFO  [master/hbase01:16000:becomeActiveMaster] 
procedure2.ProcedureExecutor: Loaded RegionProcedureStore in 61 msec
2020-01-14 06:13:23,516 INFO  [master/hbase01:16000:becomeActiveMaster] 
procedure2.RemoteProcedureDispatcher: Instantiated, coreThreads=128 
(allowCoreThreadTimeOut=true), queueMaxSize=32, operationDelay=150
2020-01-14 06:13:23,628 INFO  [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: Rolled new Procedure Store WAL, id=2
2020-01-14 06:13:23,634 INFO  [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: Remove all state logs with ID less than 1, since no 
active procedures
2020-01-14 06:13:23,634 TRACE [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: Removing 
log=file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
2020-01-14 06:13:23,634 INFO  [WALProcedureStoreSyncThread] 
wal.ProcedureWALFile: Archiving 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
 to 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/oldWALs/pv2-00000000000000000001.log
2020-01-14 06:13:23,634 ERROR [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: Unable to remove log: 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
java.io.FileNotFoundException: File 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
 does not exist
...
2020-01-14 06:13:23,635 ERROR [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: Unable to remove log: 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
java.io.FileNotFoundException: File 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
 does not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635)
...
2020-01-14 06:13:37,352 ERROR [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: Unable to remove log: 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
java.io.FileNotFoundException: File 
file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
 does not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635)
{noformat}
I believe a simple solution in this case is to explicitly call 
_WALProcedureStore.stop_ method at the end of the migration, which stops 
_WALProcedureStore_ internal _syncLoop_ thread. 

Am opening a PR soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to