[ 
https://issues.apache.org/jira/browse/HBASE-20829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531442#comment-16531442
 ] 

Duo Zhang commented on HBASE-20829:
-----------------------------------

[~zghaobac] FYI. Seem a problem for replaying remote wals...
{noformat}
2018-07-03 12:25:11,375 WARN  [RSProcedureDispatcher-pool13-t19] 
replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
asf916.gq1.ygridcore.net,33811,1530620636539 is not online
        at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:285)
        at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:276)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-07-03 12:25:11,440 DEBUG [Thread-2883] 
replication.TestSyncReplicationStandbyKillRS(111): Server 
[asf916.gq1.ygridcore.net,33811,1530620636539] marked as dead, waiting for it 
to finish dead processing
2018-07-03 12:25:11,441 DEBUG [Thread-2883] 
replication.TestSyncReplicationStandbyKillRS(114): Server 
[asf916.gq1.ygridcore.net,33811,1530620636539] still being processed, waiting
2018-07-03 12:25:11,456 WARN  [RS:3;asf916:45751] wal.AbstractFSWAL(419): 
'hbase.regionserver.maxlogs' was deprecated.
2018-07-03 12:25:11,457 INFO  [RS:3;asf916:45751] wal.AbstractFSWAL(424): WAL 
configuration: blocksize=256 MB, rollsize=128 MB, 
prefix=asf916.gq1.ygridcore.net%2C45751%2C1530620709275, suffix=, 
logDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/WALs/asf916.gq1.ygridcore.net,45751,1530620709275,
 
archiveDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/oldWALs
2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-4] 
asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping 
handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, datanodeId 
= 
DatanodeInfoWithStorage[127.0.0.1:38997,DS-6002160d-388b-4840-8538-e4c2255108be,DISK]
2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-5] 
asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping 
handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, datanodeId 
= 
DatanodeInfoWithStorage[127.0.0.1:45904,DS-e189e3c8-a1bd-475c-86c0-3891e541fc6e,DISK]
2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-3] 
asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping 
handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, datanodeId 
= 
DatanodeInfoWithStorage[127.0.0.1:39589,DS-62ced3f8-35c4-4904-80cc-4d514b8f4050,DISK]
2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor(887): Stored pid=30, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0] 
assignment.AssignmentManager(1321): 
Added=asf916.gq1.ygridcore.net,33811,1530620636539 to dead servers, submitted 
shutdown handler to be executed meta=true
2018-07-03 12:25:11,498 INFO  [PEWorker-6] procedure.ServerCrashProcedure(118): 
Start pid=30, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
2018-07-03 12:25:11,500 WARN  [RegionServerTracker-0] 
replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
org.apache.hadoop.hbase.DoNotRetryIOException: server not online 
asf916.gq1.ygridcore.net,33811,1530620636539
        at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:130)
        at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:60)
        at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:380)
        at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:193)
        at 
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:143)
        at 
org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:610)
        at 
org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:160)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-07-03 12:25:11,503 WARN  [PEWorker-4] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,503 WARN  [PEWorker-4] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,503 WARN  [PEWorker-4] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,503 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN  [PEWorker-7] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN  [PEWorker-11] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN  [PEWorker-8] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN  [PEWorker-8] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN  [PEWorker-8] 
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote 
operation for replay wals 
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
 on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually 
because the server is already dead, retry
{noformat}

> Remove the addFront assertion in MasterProcedureScheduler.doAdd
> ---------------------------------------------------------------
>
>                 Key: HBASE-20829
>                 URL: https://issues.apache.org/jira/browse/HBASE-20829
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 3.0.0, 2.1.0, 2.2.0
>
>         Attachments: HBASE-20829-debug.patch, HBASE-20829-v1.patch, 
> HBASE-20829.patch, 
> org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS-output.txt
>
>
> Timed out.
> {noformat}
> 2018-06-30 01:32:33,823 ERROR [Time-limited test] 
> replication.TestSyncReplicationStandbyKillRS(93): Failed to transit standby 
> cluster to DOWNGRADE_ACTIVE
> {noformat}
> We failed to transit the state to DA and then wait for it to become DA so 
> hang there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to