[ https://issues.apache.org/jira/browse/HBASE-20842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gavin updated HBASE-20842: -------------------------- Comment: was deleted (was: A comment with security level 'jira-users' was removed.) > Infinite loop when replaying remote wals > ---------------------------------------- > > Key: HBASE-20842 > URL: https://issues.apache.org/jira/browse/HBASE-20842 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Duo Zhang > Assignee: Guanghao Zhang > Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-20842.master.001.patch, > HBASE-20842.master.002.patch, HBASE-20842.master.002.patch, > HBASE-20842.master.002.patch > > > {noformat} > 2018-07-03 12:25:11,375 WARN [RSProcedureDispatcher-pool13-t19] > replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1 > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server > asf916.gq1.ygridcore.net,33811,1530620636539 is not online > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:285) > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:276) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-07-03 12:25:11,440 DEBUG [Thread-2883] > replication.TestSyncReplicationStandbyKillRS(111): Server > [asf916.gq1.ygridcore.net,33811,1530620636539] marked as dead, waiting for it > to finish dead processing > 2018-07-03 12:25:11,441 DEBUG [Thread-2883] > replication.TestSyncReplicationStandbyKillRS(114): Server > [asf916.gq1.ygridcore.net,33811,1530620636539] still being processed, waiting > 2018-07-03 12:25:11,456 WARN [RS:3;asf916:45751] wal.AbstractFSWAL(419): > 'hbase.regionserver.maxlogs' was deprecated. > 2018-07-03 12:25:11,457 INFO [RS:3;asf916:45751] wal.AbstractFSWAL(424): WAL > configuration: blocksize=256 MB, rollsize=128 MB, > prefix=asf916.gq1.ygridcore.net%2C45751%2C1530620709275, suffix=, > logDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/WALs/asf916.gq1.ygridcore.net,45751,1530620709275, > > archiveDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/oldWALs > 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-4] > asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping > handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, > datanodeId = > DatanodeInfoWithStorage[127.0.0.1:38997,DS-6002160d-388b-4840-8538-e4c2255108be,DISK] > 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-5] > asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping > handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, > datanodeId = > DatanodeInfoWithStorage[127.0.0.1:45904,DS-e189e3c8-a1bd-475c-86c0-3891e541fc6e,DISK] > 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-3] > asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping > handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, > datanodeId = > DatanodeInfoWithStorage[127.0.0.1:39589,DS-62ced3f8-35c4-4904-80cc-4d514b8f4050,DISK] > 2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0] > procedure2.ProcedureExecutor(887): Stored pid=30, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true > 2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0] > assignment.AssignmentManager(1321): > Added=asf916.gq1.ygridcore.net,33811,1530620636539 to dead servers, submitted > shutdown handler to be executed meta=true > 2018-07-03 12:25:11,498 INFO [PEWorker-6] > procedure.ServerCrashProcedure(118): Start pid=30, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true > 2018-07-03 12:25:11,500 WARN [RegionServerTracker-0] > replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1 > org.apache.hadoop.hbase.DoNotRetryIOException: server not online > asf916.gq1.ygridcore.net,33811,1530620636539 > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:130) > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:60) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:380) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:193) > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:143) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:610) > at > org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-07-03 12:25:11,503 WARN [PEWorker-4] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,503 WARN [PEWorker-4] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,503 WARN [PEWorker-4] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,503 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,504 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,504 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,504 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,504 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,504 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,504 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,504 WARN [PEWorker-7] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,505 WARN [PEWorker-11] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,505 WARN [PEWorker-8] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,505 WARN [PEWorker-8] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > 2018-07-03 12:25:11,505 WARN [PEWorker-8] > replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote > operation for replay wals > [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep] > on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually > because the server is already dead, retry > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)