[ 
https://issues.apache.org/jira/browse/HBASE-20842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gavin updated HBASE-20842:
--------------------------
    Comment: was deleted

(was: A comment with security level 'jira-users' was removed.)

> Infinite loop when replaying remote wals
> ----------------------------------------
>
>                 Key: HBASE-20842
>                 URL: https://issues.apache.org/jira/browse/HBASE-20842
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Duo Zhang
>            Assignee: Guanghao Zhang
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: HBASE-20842.master.001.patch, 
> HBASE-20842.master.002.patch, HBASE-20842.master.002.patch, 
> HBASE-20842.master.002.patch
>
>
> {noformat}
> 2018-07-03 12:25:11,375 WARN  [RSProcedureDispatcher-pool13-t19] 
> replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
> asf916.gq1.ygridcore.net,33811,1530620636539 is not online
>       at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:285)
>       at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:276)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> 2018-07-03 12:25:11,440 DEBUG [Thread-2883] 
> replication.TestSyncReplicationStandbyKillRS(111): Server 
> [asf916.gq1.ygridcore.net,33811,1530620636539] marked as dead, waiting for it 
> to finish dead processing
> 2018-07-03 12:25:11,441 DEBUG [Thread-2883] 
> replication.TestSyncReplicationStandbyKillRS(114): Server 
> [asf916.gq1.ygridcore.net,33811,1530620636539] still being processed, waiting
> 2018-07-03 12:25:11,456 WARN  [RS:3;asf916:45751] wal.AbstractFSWAL(419): 
> 'hbase.regionserver.maxlogs' was deprecated.
> 2018-07-03 12:25:11,457 INFO  [RS:3;asf916:45751] wal.AbstractFSWAL(424): WAL 
> configuration: blocksize=256 MB, rollsize=128 MB, 
> prefix=asf916.gq1.ygridcore.net%2C45751%2C1530620709275, suffix=, 
> logDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/WALs/asf916.gq1.ygridcore.net,45751,1530620709275,
>  
> archiveDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/oldWALs
> 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-4] 
> asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping 
> handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, 
> datanodeId = 
> DatanodeInfoWithStorage[127.0.0.1:38997,DS-6002160d-388b-4840-8538-e4c2255108be,DISK]
> 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-5] 
> asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping 
> handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, 
> datanodeId = 
> DatanodeInfoWithStorage[127.0.0.1:45904,DS-e189e3c8-a1bd-475c-86c0-3891e541fc6e,DISK]
> 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-3] 
> asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping 
> handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, 
> datanodeId = 
> DatanodeInfoWithStorage[127.0.0.1:39589,DS-62ced3f8-35c4-4904-80cc-4d514b8f4050,DISK]
> 2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor(887): Stored pid=30, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
> 2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0] 
> assignment.AssignmentManager(1321): 
> Added=asf916.gq1.ygridcore.net,33811,1530620636539 to dead servers, submitted 
> shutdown handler to be executed meta=true
> 2018-07-03 12:25:11,498 INFO  [PEWorker-6] 
> procedure.ServerCrashProcedure(118): Start pid=30, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
> 2018-07-03 12:25:11,500 WARN  [RegionServerTracker-0] 
> replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
> org.apache.hadoop.hbase.DoNotRetryIOException: server not online 
> asf916.gq1.ygridcore.net,33811,1530620636539
>       at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:130)
>       at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:60)
>       at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:380)
>       at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:193)
>       at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:143)
>       at 
> org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:610)
>       at 
> org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:160)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> 2018-07-03 12:25:11,503 WARN  [PEWorker-4] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,503 WARN  [PEWorker-4] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,503 WARN  [PEWorker-4] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,503 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN  [PEWorker-11] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN  [PEWorker-8] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN  [PEWorker-8] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN  [PEWorker-8] 
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
> operation for replay wals 
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
>  on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
> because the server is already dead, retry
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to