[ https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764104#comment-16764104 ]
Duo Zhang commented on HBASE-21862: ----------------------------------- And there is no timeout for opening a region either. You can not do this otherwise there will be double assign. The intention here is that, if the rs is alive then it will finally report the open result back, otherwise it must be dead, or at least it will abort it self. Then the SCP will take care the reassign. > region can be assigned to 2 servers due to a timed-out call > ----------------------------------------------------------- > > Key: HBASE-21862 > URL: https://issues.apache.org/jira/browse/HBASE-21862 > Project: HBase > Issue Type: Bug > Affects Versions: 3.0.0, 2.2.0 > Reporter: Sergey Shelukhin > Priority: Blocker > > It's a classic bug, sort of... the call times out to open the region, but RS > actually processes it alright. It could also happen if the response didn't > make it back due to a network issue. > As a result region is opened on two servers. > There are some mitigations possible to narrow down the race window. > 1) Don't process expired open calls, fail them. Won't help for network issues. > 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that > will require fixing other network races where master kills RS, which would > require adding state versioning to the protocol. > The fundamental fix though would require either > 1) an unknown failure from open to ascertain the state of the region from the > server. Again, this would probably require protocol changes to make sure we > ascertain the region is not opened, and also that the > already-failed-on-master open is NOT going to be processed if it's some queue > or even in transit on the network (via a nonce-like mechanism)? > 2) some form of a distributed lock per region, e.g. in ZK > 3) some form of 2PC? but the participant list cannot be determined in a > manner that's both scalable and guaranteed correct. Theoretically it could be > all RSes. > {noformat} > 2019-02-08 03:21:31,715 INFO [PEWorker-7] > procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; > TransitRegionStateProcedure table=table, > region=d0214809147e43dc6870005742d5d204, ASSIGN > 2019-02-08 03:21:31,758 INFO [PEWorker-7] > assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, > location=server1,17020,1549567999303; forceNewPlan=false, retain=true > 2019-02-08 03:21:31,984 INFO [PEWorker-13] assignment.RegionStateStore: > pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, > regionState=OPENING, regionLocation=server1,17020,1549623714617 > 2019-02-08 03:22:32,552 WARN [RSProcedureDispatcher-pool4-t3451] > assignment.RegionRemoteProcedureBase: The remote operation pid=260637, > ppid=260626, state=RUNNABLE, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... > to server server1,17020,1549623714617 failed > java.io.IOException: Call to server1/...:17020 failed on local exception: > org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, > waitTime=60145, rpcTimeout=60000^M > at > org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M > ... > Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, > waitTime=60145, rpcTimeout=60000^M > at > org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M > ... 4 more^M > {noformat} > RS: > {noformat} > hbase-regionserver.log:2019-02-08 03:22:41,131 INFO > [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: > Open ...d0214809147e43dc6870005742d5d204. > ... > hbase-regionserver.log:2019-02-08 03:25:44,751 INFO > [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: > Opened ...d0214809147e43dc6870005742d5d204. > {noformat} > Retry: > {noformat} > 2019-02-08 03:22:32,967 INFO [PEWorker-6] > assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; > pid=260626, ppid=260595, > state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, > location=server1,17020,1549623714617 > 2019-02-08 03:22:33,084 INFO [PEWorker-6] > assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null; > forceNewPlan=true, retain=false > 2019-02-08 03:22:33,238 INFO [PEWorker-7] assignment.RegionStateStore: > pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, > regionState=OPENING, regionLocation=server2,17020,1549569075319 > {noformat} > The ignore-message > {noformat} > 2019-02-08 03:25:44,754 WARN > [RpcServer.default.FPBQ.Fifo.handler=34,queue=4,port=17000] > assignment.TransitRegionStateProcedure: Received report OPENED transition > from server1,17020,1549623714617 for rit=OPENING, > location=server2,17020,1549569075319, table=table, > region=d0214809147e43dc6870005742d5d204, pid=260626 but the region is not on > it, should be a retry, ignore > {noformat} > The 2nd assignment > {noformat} > 2019-02-08 03:26:18,915 INFO [PEWorker-7] procedure2.ProcedureExecutor: > Finished subprocedure(s) of pid=260626, ppid=260595, > state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=d0214809147e43dc6870005742d5d204, ASSIGN; resume parent processing. > 2019-02-08 03:26:18,971 INFO [PEWorker-11] procedure2.ProcedureExecutor: > Finished pid=260626, ppid=260595, state=SUCCESS, hasLock=false; > TransitRegionStateProcedure table=table, > region=d0214809147e43dc6870005742d5d204, ASSIGN in 4mins, 47.347sec > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)