[jira] [Updated] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing
[ https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-22081: -- Status: Open (was: Patch Available) > master shutdown: close RpcServer and procWAL first thing > > > Key: HBASE-22081 > URL: https://issues.apache.org/jira/browse/HBASE-22081 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Major > Attachments: HBASE-22081.01.patch, HBASE-22081.02.patch, > HBASE-22081.03.patch, HBASE-22081.patch > > > I had a master get stuck due to HBASE-22079 and noticed it was logging RS > abort messages during shutdown. > [~bahramch] found some issues where messages are processed by old master > during shutdown due to a race condition in RS cache (or it could also happen > due to a network race). > Previously I found some bug where SCP was created during master shutdown that > had incorrect state (because some structures already got cleaned). > I think before master fencing is implemented we can at least make these > issues much less likely by thinking about shutdown order. > 1) First kill RCP server so we don't receive any more messages. There's no > need to receive messages when we are shutting down. Server heartbeats could > be impacted I guess, but I don't think they will be cause we currently only > kill RS on ZK timeout. > 2) Then do whatever cleanup we think is needed that requires proc wal. > 3) Then close proc WAL so no errant threads can create more procs. > 4) Then do whatever other cleanup. > 5) Finally delete znode. > Right now znode is deleted somewhat early I think, and RpcServer is closed > very late. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing
[ https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HBASE-22081: - Attachment: HBASE-22081.03.patch > master shutdown: close RpcServer and procWAL first thing > > > Key: HBASE-22081 > URL: https://issues.apache.org/jira/browse/HBASE-22081 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Major > Attachments: HBASE-22081.01.patch, HBASE-22081.02.patch, > HBASE-22081.03.patch, HBASE-22081.patch > > > I had a master get stuck due to HBASE-22079 and noticed it was logging RS > abort messages during shutdown. > [~bahramch] found some issues where messages are processed by old master > during shutdown due to a race condition in RS cache (or it could also happen > due to a network race). > Previously I found some bug where SCP was created during master shutdown that > had incorrect state (because some structures already got cleaned). > I think before master fencing is implemented we can at least make these > issues much less likely by thinking about shutdown order. > 1) First kill RCP server so we don't receive any more messages. There's no > need to receive messages when we are shutting down. Server heartbeats could > be impacted I guess, but I don't think they will be cause we currently only > kill RS on ZK timeout. > 2) Then do whatever cleanup we think is needed that requires proc wal. > 3) Then close proc WAL so no errant threads can create more procs. > 4) Then do whatever other cleanup. > 5) Finally delete znode. > Right now znode is deleted somewhat early I think, and RpcServer is closed > very late. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing
[ https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HBASE-22081: - Attachment: HBASE-22081.02.patch > master shutdown: close RpcServer and procWAL first thing > > > Key: HBASE-22081 > URL: https://issues.apache.org/jira/browse/HBASE-22081 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Major > Attachments: HBASE-22081.01.patch, HBASE-22081.02.patch, > HBASE-22081.patch > > > I had a master get stuck due to HBASE-22079 and noticed it was logging RS > abort messages during shutdown. > [~bahramch] found some issues where messages are processed by old master > during shutdown due to a race condition in RS cache (or it could also happen > due to a network race). > Previously I found some bug where SCP was created during master shutdown that > had incorrect state (because some structures already got cleaned). > I think before master fencing is implemented we can at least make these > issues much less likely by thinking about shutdown order. > 1) First kill RCP server so we don't receive any more messages. There's no > need to receive messages when we are shutting down. Server heartbeats could > be impacted I guess, but I don't think they will be cause we currently only > kill RS on ZK timeout. > 2) Then do whatever cleanup we think is needed that requires proc wal. > 3) Then close proc WAL so no errant threads can create more procs. > 4) Then do whatever other cleanup. > 5) Finally delete znode. > Right now znode is deleted somewhat early I think, and RpcServer is closed > very late. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing
[ https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HBASE-22081: - Attachment: HBASE-22081.01.patch > master shutdown: close RpcServer and procWAL first thing > > > Key: HBASE-22081 > URL: https://issues.apache.org/jira/browse/HBASE-22081 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Major > Attachments: HBASE-22081.01.patch, HBASE-22081.patch > > > I had a master get stuck due to HBASE-22079 and noticed it was logging RS > abort messages during shutdown. > [~bahramch] found some issues where messages are processed by old master > during shutdown due to a race condition in RS cache (or it could also happen > due to a network race). > Previously I found some bug where SCP was created during master shutdown that > had incorrect state (because some structures already got cleaned). > I think before master fencing is implemented we can at least make these > issues much less likely by thinking about shutdown order. > 1) First kill RCP server so we don't receive any more messages. There's no > need to receive messages when we are shutting down. Server heartbeats could > be impacted I guess, but I don't think they will be cause we currently only > kill RS on ZK timeout. > 2) Then do whatever cleanup we think is needed that requires proc wal. > 3) Then close proc WAL so no errant threads can create more procs. > 4) Then do whatever other cleanup. > 5) Finally delete znode. > Right now znode is deleted somewhat early I think, and RpcServer is closed > very late. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing
[ https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HBASE-22081: - Summary: master shutdown: close RpcServer and procWAL first thing (was: master shutdown: close RpcServer first thing, close procWAL as soon as viable, and delete znode the last thing) > master shutdown: close RpcServer and procWAL first thing > > > Key: HBASE-22081 > URL: https://issues.apache.org/jira/browse/HBASE-22081 > Project: HBase > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Major > Attachments: HBASE-22081.patch > > > I had a master get stuck due to HBASE-22079 and noticed it was logging RS > abort messages during shutdown. > [~bahramch] found some issues where messages are processed by old master > during shutdown due to a race condition in RS cache (or it could also happen > due to a network race). > Previously I found some bug where SCP was created during master shutdown that > had incorrect state (because some structures already got cleaned). > I think before master fencing is implemented we can at least make these > issues much less likely by thinking about shutdown order. > 1) First kill RCP server so we don't receive any more messages. There's no > need to receive messages when we are shutting down. Server heartbeats could > be impacted I guess, but I don't think they will be cause we currently only > kill RS on ZK timeout. > 2) Then do whatever cleanup we think is needed that requires proc wal. > 3) Then close proc WAL so no errant threads can create more procs. > 4) Then do whatever other cleanup. > 5) Finally delete znode. > Right now znode is deleted somewhat early I think, and RpcServer is closed > very late. -- This message was sent by Atlassian JIRA (v7.6.3#76005)