[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-11-28 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702540#comment-16702540
 ] 

Josh Elser commented on HBASE-20671:


Thanks Allan!!

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-11-11 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683151#comment-16683151
 ] 

Allan Yang commented on HBASE-20671:


[~elserj], FYI, HBASE-21395 is a work around for this issue in branch-2.0 and 
branch-2.1, in branch-2+, RTSP should reslove this kind of problem.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-10-26 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665266#comment-16665266
 ] 

Josh Elser commented on HBASE-20671:


Sweet! I'll have to keep an eye out for it too with that info. Thanks Allan! 
Super helpful.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-10-25 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664537#comment-16664537
 ] 

Allan Yang commented on HBASE-20671:


This one is valid still, I have seen similar issue like split parent was 
brought online causing the referenced file being compacted. Will come back 
later.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-21 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587995#comment-16587995
 ] 

Josh Elser commented on HBASE-20671:


(late to the party)

Thanks for the info, [~taklwu]. Feel free to ping if you need help – will try 
to give you some review cycles when you get there, otherwise.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-21 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587964#comment-16587964
 ] 

stack commented on HBASE-20671:
---

On the tail of HBASE-21032, looks like an experience similar to yours [~taklwu]

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-20 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586456#comment-16586456
 ] 

stack commented on HBASE-20671:
---

Thank you [~taklwu] Keep us posted. Fine to fill in commentary here since seems 
like what you are seeing is related.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-18 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584712#comment-16584712
 ] 

Hadoop QA commented on HBASE-20671:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} patch {color} | {color:blue}  0m  
3s{color} | {color:blue} The patch file was not named according to hbase's 
naming conventions. Please see 
https://yetus.apache.org/documentation/0.7.0/precommit-patchnames for 
instructions. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} HBASE-20671 does not apply to master. Rebase required? Wrong 
Branch? See https://yetus.apache.org/documentation/0.7.0/precommit-patchnames 
for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HBASE-20671 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12926808/workaround.txt |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14085/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-18 Thread Tak Lon (Stephen) Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584709#comment-16584709
 ] 

Tak Lon (Stephen) Wu commented on HBASE-20671:
--

hi guys, I am not 100% sure yet but I recently worked on using 
{{hbase.readonly}} to be true on hbase-2.1.0 for a read replica cluster that 
the {{hbase:namespace}} cannot be assigned (infinite loop when 
{{isTableAssigned}} is checking for {{hbase:namespace}} table but return false) 
during the read replica cluster startup.

I found the patch of HBASE-20702 has skipped `empty` rows but seems like rows 
for system table(s) e.g. {{hbase:namespace}} should not be considered as empty. 
I made my band-aid change below and the cluster resumed to be started. 
{noformat}
  private void loadMeta() throws IOException {
// TODO: use a thread pool
regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
  @Override
  public void visitRegionState(Result result, final RegionInfo regionInfo, 
final State state,
  final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
if (!regionInfo.getTable().equals(TableName.NAMESPACE_TABLE_NAME)) { // 
<-- added to unblock the read replica cluster
  if (state == null && regionLocation == null && lastHost == null
  && openSeqNum == SequenceId.NO_SEQUENCE_ID) {
// This is a row with nothing in it.
LOG.warn("Skipping empty row={}", result);
return;
  }
}
{noformat}

so, do you guys think I should fix it in other place?

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-14 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580383#comment-16580383
 ] 

Josh Elser commented on HBASE-20671:


I haven't seen it recently, but I haven't been keeping a super-close eye on 
this stuff like I was previously. Whatever you choose to track it is fine to me.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-13 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579011#comment-16579011
 ] 

stack commented on HBASE-20671:
---

Downing priority and moving out of 2.0.2. Josh hasn't seen this recently. We 
have put in a bit of a workaround too. Leaving open in case we get a fresh 
instance.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-01 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565745#comment-16565745
 ] 

Josh Elser commented on HBASE-20671:


I haven't seen it recently. Not sure if it's "fixed" by something else or not. 
Need to sync up with tests runs that have been happening over here..

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.2
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-08-01 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565723#comment-16565723
 ] 

stack commented on HBASE-20671:
---

We still critical on this one given we committed workaround in sub-issue? I 
don't see this in my test runs but I don't have merges in the mix. I should add 
them?

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.2
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-12 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509736#comment-16509736
 ] 

Josh Elser commented on HBASE-20671:


bq. Will I commit this in a subissue as a workaround?

I think I already saw you do this, but +1 :)

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504193#comment-16504193
 ] 

stack commented on HBASE-20671:
---

Will I commit this in a subissue as a workaround?

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504182#comment-16504182
 ] 

Hadoop QA commented on HBASE-20671:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} patch {color} | {color:blue}  0m  
2s{color} | {color:blue} The patch file was not named according to hbase's 
naming conventions. Please see 
https://yetus.apache.org/documentation/0.7.0/precommit-patchnames for 
instructions. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
41s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
33s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 53s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}109m 
11s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
19s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}149m 49s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-20671 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12926808/workaround.txt |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux bd079b4d7ac9 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 
13:48:03 UTC 2016 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | master / b002b0b8b0 |
| maven | version: Apache Maven 3.5.3 
(3383c37e1f9e9b3bc3df5050c29c8aff9f295297; 2018-02-24T19:49:05Z) |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
htt

[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504155#comment-16504155
 ] 

Josh Elser commented on HBASE-20671:


+1 to your extra debugging. That will pay for itself in spades!

I was going to suggest that you include the {{regionInfo}} in the log message, 
but I see that it will already be there from RegionStateStore.

Let me get QA going.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504033#comment-16504033
 ] 

stack commented on HBASE-20671:
---

[~elserj] I think we should commit workaround.txt. It will not load an empty 
row as is happening here. It will also dump out all about the row should this 
scenario ever arise again which hopefully will give us more clue as to what is 
going on.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503962#comment-16503962
 ] 

stack commented on HBASE-20671:
---

Ok. Assigning to you then. I'm interested. This is a good one so shout if more 
info to figure. Thanks.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503926#comment-16503926
 ] 

Josh Elser commented on HBASE-20671:


Thanks, sir. I can try to take this back and do more digging on my end with the 
great insight you've given already.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503922#comment-16503922
 ] 

stack commented on HBASE-20671:
---

[~elserj attached my test if it helps...

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503919#comment-16503919
 ] 

stack commented on HBASE-20671:
---

bq. I'm very suspect of something in the region replication codepath (if that's 
what you are thinking by "errant writer"). I want to try to repro this in a 
junit test now (if you haven't already).

I made a test where I turned off the hbase:meta GC'er, merged, and then killed 
master. We don't assign out any merge parent. We don't find a ghost line in the 
hbase:meta when we load hbase:meta when new Master comes up.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503917#comment-16503917
 ] 

Josh Elser commented on HBASE-20671:


{quote}Let me make sure there weren't any other tests that were being run 
against the same cluster.. (they should all be disabled internally).
{quote}
Confirmed. No rogue hbck'ing happening.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503916#comment-16503916
 ] 

stack commented on HBASE-20671:
---

07 crashed before the master did... and we had started a recovery.

2018-05-30 04:27:10,012 INFO  [main-EventThread] master.RegionServerTracker: 
RegionServer ephemeral node deleted, processing expiration 
[ctr-e138-1518143905142-336066-01-07.hwx.site,16020,1527652019629]

Claims no WALs for this server:

2018-05-30 04:27:10,131 INFO  [PEWorker-15] master.SplitLogManager: 
hdfs://mycluster/apps/hbase/data/WALs/ctr-e138-1518143905142-336066-01-07.hwx.site,16020,1527652019629-splitting
 is empty dir, no logs to split

... which is hard to believe. Would be good to check the 07 side if this 
true.


> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503900#comment-16503900
 ] 

Josh Elser commented on HBASE-20671:


{quote}or an errant writer? An hbck or something?
{quote}
I know this exact test doesn't do any HBCK'ing. Let me make sure there weren't 
any other tests that were being run against the same cluster.. (they should all 
be disabled internally).

I'm very suspect of something in the region replication codepath (if that's 
what you are thinking by "errant writer"). I want to try to repro this in a 
junit test now (if you haven't already).

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503890#comment-16503890
 ] 

stack commented on HBASE-20671:
---

Thanks [~elserj]

The hbase:meta listing is interesting. The timestamps are behind the timestamp 
of the delete of the family done by:

2018-05-30 04:26:00,634 DEBUG [PEWorker-9] hbase.META: Delete 
{"totalColumns":1,"row":"tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.","families":{"info":[

So... unlikely it HBASE-20065. Skipped hbase:meta recovery (not too likely but 
perhaps) or an errant writer? An hbck or something?

We assign meta to 07 at 2018-05-30 04:01:39,120 , about twenty minutes 
before the fun starts.






> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503830#comment-16503830
 ] 

Josh Elser commented on HBASE-20671:


{noformat}
create 'tabletwo_merge', 'f1', {SPLITS => ['1', '2', '3']}, {REGION_REPLICATION 
=> 3}{noformat}
I also have the output of meta before and after the merge of the two regions 
done:

Before
{noformat}
tabletwo_merge column=table:state, timestamp=1527652131284, value=\x08\x00
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:regioninfo, timestamp=1527654100907, value={ENCODED => 
a7dd6606dcacc9daf085fc9fa2aecc0c, NAME => 
'tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.', STARTKEY => 
'', ENDKEY => '1'}
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:seqnumDuringOpen, timestamp=1527654100907, 
value=\x00\x00\x00\x00\x00\x00\x00\x09
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:seqnumDuringOpen_0001, timestamp=1527652131228, 
value=\x00\x00\x00\x00\x00\x00\x00\x02
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:seqnumDuringOpen_0002, timestamp=1527652131231, 
value=\x00\x00\x00\x00\x00\x00\x00\x02
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:server, timestamp=1527654100907, 
value=ctr-e138-1518143905142-336066-01-05.hwx.site:16020
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:server_0001, timestamp=1527652131228, 
value=ctr-e138-1518143905142-336066-01-06.hwx.site:16020
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:server_0002, timestamp=1527652131231, 
value=ctr-e138-1518143905142-336066-01-07.hwx.site:16020
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:serverstartcode, timestamp=1527654100907, value=1527651997433
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:serverstartcode_0001, timestamp=1527652131228, 
value=\x00\x00\x01c\xAF'\xB8\xBD
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:serverstartcode_0002, timestamp=1527652131231, 
value=\x00\x00\x01c\xAF'\xED\xAD
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. column=info:sn, 
timestamp=1527654100695, 
value=ctr-e138-1518143905142-336066-01-05.hwx.site,16020,1527651997433
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:sn_0001, timestamp=1527652131017, 
value=ctr-e138-1518143905142-336066-01-06.hwx.site,16020,1527652006077
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:sn_0002, timestamp=1527652131017, 
value=ctr-e138-1518143905142-336066-01-07.hwx.site,16020,1527652019629
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:state, timestamp=1527654100907, value=OPEN
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:state_0001, timestamp=1527652131228, value=OPEN
tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c. 
column=info:state_0002, timestamp=1527652131231, value=OPEN
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:regioninfo, timestamp=1527652131250, value={ENCODED => 
4017a3c778551d4d258c785d455f9c0b, NAME => 
'tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b.', STARTKEY => 
'1', ENDKEY => '2'}
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:seqnumDuringOpen, timestamp=1527652131250, 
value=\x00\x00\x00\x00\x00\x00\x00\x02
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:seqnumDuringOpen_0001, timestamp=1527652131213, 
value=\x00\x00\x00\x00\x00\x00\x00\x02
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:seqnumDuringOpen_0002, timestamp=1527652131222, 
value=\x00\x00\x00\x00\x00\x00\x00\x02
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:server, timestamp=1527652131250, 
value=ctr-e138-1518143905142-336066-01-05.hwx.site:16020
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:server_0001, timestamp=1527652131213, 
value=ctr-e138-1518143905142-336066-01-02.hwx.site:16020
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:server_0002, timestamp=1527652131222, 
value=ctr-e138-1518143905142-336066-01-03.hwx.site:16020
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:serverstartcode, timestamp=1527652131250, value=1527651997433
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:serverstartcode_0001, timestamp=1527652131213, 
value=\x00\x00\x01c\xAF'F#
tabletwo_merge,1,1527652130538.4017a3c778551d4d258c785d455f9c0b. 
column=info:serverstartcode_0002, timestamp=1527652131222, 
value=\x00\x00\x01c\xAF'r?
tabletwo_merge,1,1527652130538.4017a3c77855

[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503817#comment-16503817
 ] 

Josh Elser commented on HBASE-20671:


{quote}Did this cluster have HBASE-20065?
{quote}
Yeah, it should've been remarkably close to branch-2.0 (around the time I 
filed, of course)
{quote}Was it running w/ read replicas?
{quote}
Actually, it was! I missed that detail the first time. The test was doing some 
shell commands before we say this server-side. Digging into that side more now. 
I can pull some more logs out.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-06 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503728#comment-16503728
 ] 

stack commented on HBASE-20671:
---

This is an odd one.

We deleted the merge parent row info column family well before the crash:

2018-05-30 04:26:00,634 DEBUG [PEWorker-9] hbase.META: Delete 
{"totalColumns":1,"row":"tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.","families":{"info":[{"qualifier":"","vlen":0,"tag":[],"timestamp":1527654360633}]},"ts":9223372036854775807}

The row is supposed to be GONE from hbase:meta at this point.

... Yet post-crash, three minutes later, we find that at least the 
info;regioninfo column is still present (otherwise we'd skip the row in 
hbase:meta reload):

2018-05-30 04:29:20,263 INFO  
[master/ctr-e138-1518143905142-336066-01-03:2] 
assignment.RegionStateStore: Load hbase:meta entry 
region=a7dd6606dcacc9daf085fc9fa2aecc0c, regionState=null, lastHost=null, 
regionLocation=null, seqnum=-1
2018-05-30 04:29:20,263 INFO  
[master/ctr-e138-1518143905142-336066-01-03:2] 
assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
regionState=null; presuming OFFLINE
2018-05-30 04:29:20,263 INFO  
[master/ctr-e138-1518143905142-336066-01-03:2] assignment.RegionStates: 
Added to offline, CURRENTLY NEVER CLEARED!!! rit=OFFLINE, location=null, 
table=tabletwo_merge, region=a7dd6606dcacc9daf085fc9fa2aecc0c

Did this cluster have HBASE-20065? Was it running w/ read replicas?

The RS log is for the hbase:meta host after master comes back but it was not 
carrying hbase:meta during interesting time. Do we have the log from that 
server? I see that on master start, we open hbase:meta but we do not replay any 
logs. Perhaps the previous host did clean shutdown? If it did not, perhaps this 
is where we went awry.. we are skipping recovery of hbase:meta?

I ask about HBASE-20065 because it is about master setting timestamp on edits 
rather than letting it to the RS. It does set ts in all places so suspicious.

I ask about read replicas because perhaps they are updating state on a 
non-existent region? I don't know enough about how these work.


> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> 

[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-05 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502635#comment-16502635
 ] 

stack commented on HBASE-20671:
---

Yeah. Good one.  Problematic code is in the load of meta after a master crash:

{code}
State localState = state;
if (localState == null) {
  // No region state column data in hbase:meta table! Are I doing a 
rolling upgrade from
  // hbase1 to hbase2? Am I restoring a SNAPSHOT or otherwise adding a 
region to hbase:meta?
  // In any of these cases, state is empty. For now, presume OFFLINE 
but there are probably
  // cases where we need to probe more to be sure this correct; TODO 
informed by experience.
  LOG.info(regionInfo.getEncodedName() + " regionState=null; presuming 
" + State.OFFLINE);
  localState = State.OFFLINE;
}
{code]

Above note allows that there are cases where the presumption that the region is 
OFFLINE is wrong (per HBASE-19529).

Let me write a test for this one to repro. We need a FOR_GC state to which we 
set Regions that are up for deletion.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-04 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500810#comment-16500810
 ] 

Josh Elser commented on HBASE-20671:


I would not turn down your help, [~stack]. I'm happy to try to help out if you 
think it's something I can pick up easily enough.

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-01 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498661#comment-16498661
 ] 

stack commented on HBASE-20671:
---

Seems related to HBASE-19529. I can take it if you want [~elserj]


> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

2018-06-01 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498572#comment-16498572
 ] 

Josh Elser commented on HBASE-20671:


Broad FYI for folks I know looking at similar things already: [~stack], 
[~Apache9], [~yuzhih...@gmail.com]

It seems to me that, while replaying the pv2 WAL, we need to somehow determine 
when a region has been deleted. I'd guess that this means ripping through the 
entire log before taking action, but I'm not sure if that's practical... Maybe 
I need to try to write a test first?

> Merged region brought back to life causing RS to be killed by Master
> 
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.0.0
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Attachments: 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-03.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-02.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=2] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-03:2] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=2] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-02.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)