[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId" and "enable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Summary: [Replication] When execute shell cmd "disable_peer peerId" and "enable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue (was: [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue) > [Replication] When execute shell cmd "disable_peer peerId" and "enable_peer > peerId",the master web UI show a wrong number of SizeOfLogQueue > > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' and enable_peer '1', then i can see the SizeOfLogQueue > metric of all regionservers +1 , after 10 times disable_peer ops , it > will increase to 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) is > called , it will terminate the previous replication source and create a new > one. and found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of > (HBASE-23231) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' and enable_peer '1', then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) is called , it will terminate the previous replication source and create a new one. and found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of (HBASE-23231) was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' and enable_peer '1', then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) is called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of (HBASE-23231) > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' and enable_peer '1', then i can see the SizeOfLogQueue > metric of all regionservers +1 , after 10 times disable_peer ops , it > will increase to 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) is > called , it will terminate the previous replication source and create a new > one. and found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of > (HBASE-23231) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' and enable_peer '1', then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) is called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of (HBASE-23231) was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) is called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of (HBASE-23231) > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' and enable_peer '1', then i can see the SizeOfLogQueue > metric of all regionservers +1 , after 10 times disable_peer ops , it > will increase to 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) is > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of > (HBASE-23231) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) is called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of (HBASE-23231) was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of (HBASE-23231) > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) is > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of > (HBASE-23231) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Comment: was deleted (was: h1. ReplicationSource do not update metrics after refresh) > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of > (HBASE-23231) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of (HBASE-23231) was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maybe we should set true when execute terminate() ? > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of > (HBASE-23231) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166934#comment-17166934 ] leizhang commented on HBASE-24781: -- h1. ReplicationSource do not update metrics after refresh > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, maybe we should set true when > execute terminate() ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, mabe we should set true when execute terminate() ? was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, mabe we should set true when > execute terminate() ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, maby we should set true when > execute terminate() ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maybe we should set true when execute terminate() ? was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, mabe we should set true when execute terminate() ? > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, maybe we should set true when > execute terminate() ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, maby we should set true when > execute terminate() ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block: {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note //Do not clear metrics in the bellow code block {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note //Do not clear metrics in the bellow code block: > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, maby we should set true when > execute terminate() ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will terminate the previous replication source and create a new one. and i found the note {code:java} ReplicationSourceInterface toRemove = this.sources.put(peerId, src); if (toRemove != null) { LOG.info("Terminate replication source for " + toRemove.getPeerId()); // Do not clear metrics toRemove.terminate(terminateMessage, null, false); } {code} this cause the wrong number of sizeOfLogQueue, maby we should set true when execute terminate() ? was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to add a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue metric twice ? thx > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will terminate the previous replication source and create a new > one. and i found the note > {code:java} > ReplicationSourceInterface toRemove = this.sources.put(peerId, src); > if (toRemove != null) { > LOG.info("Terminate replication source for " + toRemove.getPeerId()); > // Do not clear metrics > toRemove.terminate(terminateMessage, null, false); > } > {code} > this cause the wrong number of sizeOfLogQueue, maby we should set true when > execute terminate() ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to add a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue metric twice ? thx was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to add a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue twice ? thx > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will enqueue the current wals to the source , maybe when the > current wal is already in the replication queue , we try to add a duplicated > wal to the source ,which cause the same wal increase the SizeOfLogQueue > metric twice ? thx > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Summary: [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue (was: when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue) > [Replication] When execute shell cmd "disable_peer peerId",the master web UI > show a wrong number of SizeOfLogQueue > --- > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will enqueue the current wals to the source , maybe when the > current wal is already in the replication queue , we try to add a duplicated > wal to the source ,which cause the same wal increase the SizeOfLogQueue > twice ? thx > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to add a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue twice ? thx was: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to and a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue twice ? thx > when execute shell cmd "disable_peer peerId",the master web UI show a wrong > number of SizeOfLogQueue > - > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will enqueue the current wals to the source , maybe when the > current wal is already in the replication queue , we try to add a duplicated > wal to the source ,which cause the same wal increase the SizeOfLogQueue > twice ? thx > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to and a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue twice ? thx was: Supposed that we have an source peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to and a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue twice ? thx > when execute shell cmd "disable_peer peerId",the master web UI show a wrong > number of SizeOfLogQueue > - > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will enqueue the current wals to the source , maybe when the > current wal is already in the replication queue , we try to and a duplicated > wal to the source ,which cause the same wal increase the SizeOfLogQueue > twice ? thx > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
[ https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-24781: - Description: Supposed that we have an source peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to and a duplicated wal to the source ,which cause the same wal increase the SizeOfLogQueue twice ? thx was: Supposed that we have an source peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to and a duplicated wal to the source ,and cause the same wal increase the SizeOfLogQueue twice ? thx > when execute shell cmd "disable_peer peerId",the master web UI show a wrong > number of SizeOfLogQueue > - > > Key: HBASE-24781 > URL: https://issues.apache.org/jira/browse/HBASE-24781 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.2.5 >Reporter: leizhang >Priority: Major > > Supposed that we have an source peer with id 1, when execute shell cmd > disable_peer '1' , then i can see the SizeOfLogQueue metric of all > regionservers +1 , after 10 times disable_peer ops , it will increase to > 11, and it will never decrease to 1 in fulture . > I can see the function ReplicationSourceManager.refreshSources(peerId) > called , it will enqueue the current wals to the source , maybe when the > current wal is already in the replication queue , we try to and a duplicated > wal to the source ,which cause the same wal increase the SizeOfLogQueue > twice ? thx > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue
leizhang created HBASE-24781: Summary: when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue Key: HBASE-24781 URL: https://issues.apache.org/jira/browse/HBASE-24781 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.2.5 Reporter: leizhang Supposed that we have an source peer with id 1, when execute shell cmd disable_peer '1' , then i can see the SizeOfLogQueue metric of all regionservers +1 , after 10 times disable_peer ops , it will increase to 11, and it will never decrease to 1 in fulture . I can see the function ReplicationSourceManager.refreshSources(peerId) called , it will enqueue the current wals to the source , maybe when the current wal is already in the replication queue , we try to and a duplicated wal to the source ,and cause the same wal increase the SizeOfLogQueue twice ? thx -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Comment: was deleted (was: I think this problem still exist in Hbase2.x , when i use Hbase2.2.5 , I encounter the same problem .) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.1.0, 1.4.8, 1.4.9, 2.2.5 >Reporter: leizhang >Assignee: yaojingyi >Priority: Major > Attachments: HBASE-22620.branch-1.4.001.patch > > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166100#comment-17166100 ] leizhang commented on HBASE-22620: -- I think this problem still exist in Hbase2.x , when i use Hbase2.2.5 , I encounter the same problem . > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.1.0, 1.4.8, 1.4.9, 2.2.5 >Reporter: leizhang >Assignee: yaojingyi >Priority: Major > Attachments: HBASE-22620.branch-1.4.001.patch > > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Affects Version/s: (was: 1.2.4) 2.2.5 > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.1.0, 1.4.8, 1.4.9, 2.2.5 >Reporter: leizhang >Assignee: yaojingyi >Priority: Major > Attachments: HBASE-22620.branch-1.4.001.patch > > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876588#comment-16876588 ] leizhang commented on HBASE-22620: -- {code:java} //代码占位符 WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout); if (entryBatch == null) { manager.cleanOldLogs(this.getCurrentPath().getName(),peerClusterZnode,this.replicationQueueInfo.isQueueRecovered()); continue; } shipEdits(entryBatch); {code} Thank you very much ! At present I clean the old log zk refs by call the method cleanOldLogs when the entryBath is empty,it indeed worked, but i don't know whether it is appropriate to put the logic here, expect your patch. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 2.1.0, 1.4.8, 1.4.9 >Reporter: leizhang >Assignee: yaojingyi >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Affects Version/s: 2.1.0 1.4.8 > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 2.1.0, 1.4.8, 1.4.9 >Reporter: leizhang >Assignee: yaojingyi >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Fix Version/s: (was: 2.1.0) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874710#comment-16874710 ] leizhang commented on HBASE-22620: -- not only the pressure on zk.consider when there is a large amount of data write in, which is writen by non-replication tables , the hlog is not empty, it will also accumulate hlogs on hdfs . this is the reason which causes hlogs under hdfs directory /oldWALs reached about 30TB. the -HBASE-20206- may be not helpful for this issue. I view the code of HBASE2.1.0 and find the logic is : {code:java} WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout); if (entryBatch == null) { // since there is no logs need to replicate, we refresh the ageOfLastShippedOp source.getSourceMetrics().setAgeOfLastShippedOp(EnvironmentEdgeManager.currentTime(), walGroupId); continue; } // the NO_MORE_DATA instance has no path so do not call shipEdits if (entryBatch == WALEntryBatch.NO_MORE_DATA) { noMoreData(); } else { shipEdits(entryBatch); } {code} the entryReader.take() in hbase 1.4.9 has been replaced by entryReader.poll(getEntriesTimeout) , indeed, the thread will not be blocked any more . but if entryBatch is null , it only update the age of the metric , and then the loop continue , the shipEdit() method is still not called . could you show me when entry is null , where is the logic to handle the old hlogs ? Thank you . > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 1.4.9 >Reporter: leizhang >Priority: Major > Fix For: 2.1.0 > > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang reopened HBASE-22620: -- > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 1.4.9 >Reporter: leizhang >Priority: Major > Fix For: 2.1.0 > > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Comment: was deleted (was: thank you very much ! I check the source code of Hbase2.1.0 ,and find the entryReader.take() has been replaced by entryReader.poll(getEntriesTimeout); then the tread will not be blocked and will excute the following logic, and the problem can be solved !) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 1.4.9 >Reporter: leizhang >Priority: Major > Fix For: 2.1.0 > > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang resolved HBASE-22620. -- Resolution: Fixed Fix Version/s: 2.1.0 > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 1.4.9 >Reporter: leizhang >Priority: Major > Fix For: 2.1.0 > > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874063#comment-16874063 ] leizhang commented on HBASE-22620: -- thank you very much ! I check the source code of Hbase2.1.0 ,and find the entryReader.take() has been replaced by entryReader.poll(getEntriesTimeout); then the tread will not be blocked and will excute the following logic, and the problem can be solved ! > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Affects Version/s: (was: 2.0.3) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Affects Version/s: 1.2.4 > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 1.2.4, 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873771#comment-16873771 ] leizhang commented on HBASE-22620: -- Yesterday, I found the Hbase 1.2.4 also exists the same problem > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873771#comment-16873771 ] leizhang edited comment on HBASE-22620 at 6/27/19 2:41 AM: --- Yesterday,when I check our cluster prod, I found the Hbase 1.2.4 also exists the same problem was (Author: zl_cn_hbase): Yesterday, I found the Hbase 1.2.4 also exists the same problem > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873770#comment-16873770 ] leizhang commented on HBASE-22620: -- Sorry , I only check the code in Hbase 2.x, I find it is same as 1.4.9 , accually I haven't do any practice on hbase 2.x > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Description: When I open the replication feature on my hbase cluster (20 regionserver nodes) and added a peer cluster, for example, I create a table with 3 regions with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of wal references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned up, which cause lots of wal file on hdfs will not be cleaned up either. When I check my test cluster after about four months, it accumulates about 5w wal files in the oldWal directory on hdfs. The source code shows that only there are data to be replicated, and after some data is replicated in the source endpoint, then it will executed the useless wal file check, and clean their references on zk, and the hdfs useless wal files will be cleaned up normally. So I think do we need other method to trigger the useless wal cleaning job in a replication cluster? May be in the replication progress report schedule task (just like ReplicationStatisticsTask.class) (was: When I open the replication feature on my hbase cluster (20 regionserver nodes), for example, I create a table with 3 regions, which opened on 3 regionservers of 20. Due to no data to replicate ,the left 17 nodes accumulate lots of wal references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned up, which cause lots of wal file on hdfs will not be cleaned up either. When I check my test cluster after about four months, it accumulates about 5w wal files in the oldWal directory on hdfs. The source code shows that only there are data to be replicated, and after some data is replicated in the source endpoint, then it will executed the useless wal file check, and clean their references on zk, and the hdfs useless wal files will be cleaned up normally. So I think do we need other method to trigger the useless wal cleaning job in a replication cluster? May be in the replication progress report schedule task (just like ReplicationStatisticsTask.class)) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes) and added a peer cluster, for example, I create a table with 3 regions > with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due > to no data(entryBatch) to replicate ,the left 17 nodes accumulate lots of > wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Issue Type: Bug (was: Improvement) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873305#comment-16873305 ] leizhang commented on HBASE-22620: -- did you expect one just like this ? Hbase version 1.4.9 {code:java} main-EventThread.replicationSource,1.replicationSource.xx.hbase.lq2%2C16020%2C1561379323483,1" #153306 daemon prio=5 os_prio=0 tid=0x7f0844681800 nid=0xe49 waiting on condition [0x7ef573a1d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f05ba84c3d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550) "main.replicationSource,1-EventThread" #153305 daemon prio=5 os_prio=0 tid=0x7f0844765800 nid=0xe48 waiting on condition [0x7ef57391c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f05ba847f50> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501) {code} once the peer was added, if no entry to replicate, the hlog refs will accumulate on zk, and hlog accumulate in /oldWALs I also found that due to the hug data amout (about 30T hlog file under /hbase/oldWALs ), when I execute the command "remove_peer 'peer1' " on my cluster, the master show the log bellow, and all regionservers abort {code:java} //代码占位符 ERROR [B.defaultRpcServer.handler=172,queue=22,port=16000] master.MasterRpcServices: Region server ,16020,1503477315622 reported a fatal error: ABORTING region server xxx,16020,1503477315622: Failed to delete queue (queueId=peer1) Cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:672) at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1671) at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1413) at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1280) at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeQueue(ReplicationQueuesZKImpl.java:93) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.deleteSource(ReplicationSourceManager.java:298) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:579) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.peerRemoved(ReplicationSourceManager.java:590) at org.apache.hadoop.hbase.replication.ReplicationTrackerZKImpl$PeersWatcher.nodeDeleted(ReplicationTrackerZKImpl.java:171) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:628) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} So i have to remove the hlog refs on zk manually and let regionserver clean hlogs nomally. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17
[jira] [Comment Edited] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870944#comment-16870944 ] leizhang edited comment on HBASE-22620 at 6/26/19 8:12 AM: --- I view the code that create the entryBatch,just as follows : {code:java} ReplicationSourceWALReaderThread.class -> run() -> entryStream.hasNext() -> tryAdvanceEntry() ->checkReader()->openNextLog()->readNextEntryAndSetPosition() {code} when reaching the end of a wal file ,it will switch to the next hlog, so the current log position update logic is correct,and all the hlog are available by replication source endpoint and switch correctly no matter whether we have entry to replicate, but the zk refs cleaning up job may be blocked due to the blocking queue's take() method. Consider a cluster configured replication(set hbase.replication to true,and peer has been set) but all tables don't open the replication property, all the log of the cluster will be keep. was (Author: zl_cn_hbase): I view the code that create the entryBatch,just as follows : {code:java} ReplicationSourceWALReaderThread.class -> run() -> entryStream.hasNext() -> tryAdvanceEntry() ->checkReader()->openNextLog()->readNextEntryAndSetPosition() {code} when reaching the end of a wal file ,it will switch to the next hlog, so the current log position update logic is correct,and all the hlog are available by replication source endpoint and switch correctly no matter whether we have entry to replicate, but the zk refs cleaning up job may be blocked due to the blocking queue's take() method. Imagine a cluster configured replication(set hbase.replication to true) but all tables don't open the replication property, all the log of the cluster will be keep. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871929#comment-16871929 ] leizhang commented on HBASE-22620: -- I print the stack info of my regionserver, and find the thread blocking here: {code:java} "regionserver/hostxxx/ipxx:16020.replicationSource,1-EventThread" #421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on condition [0x7ef574423000] The network connection was aborted by the local system. at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7eff45024a28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501) {code} > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871918#comment-16871918 ] leizhang commented on HBASE-22620: -- "regionserver/x:16020.replicationSource,1-EventThread" #421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on condition [0x7ef574423000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7eff45024a28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)//blocking here at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Comment: was deleted (was: "regionserver/x:16020.replicationSource,1-EventThread" #421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on condition [0x7ef574423000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7eff45024a28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)//blocking here at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871917#comment-16871917 ] leizhang commented on HBASE-22620: -- I print the stack info of my regionserver,and find the thread blocking here: {code:java} "regionserver/hbase-zeus-26-242-225.hadoop.lq2/10.26.242.225:16020.replicationSource,1-EventThread" #421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on condition [0x7ef574423000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7eff45024a28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) //blocking here at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501) {code} > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Comment: was deleted (was: I print the stack info of my regionserver,and find the thread blocking here: {code:java} "regionserver/hbase-zeus-26-242-225.hadoop.lq2/10.26.242.225:16020.replicationSource,1-EventThread" #421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on condition [0x7ef574423000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7eff45024a28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) //blocking here at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501) {code}) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870983#comment-16870983 ] leizhang commented on HBASE-22620: -- sorry , i mean logPositionAndCleanOldLogs() > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Comment: was deleted (was: Thank you, you can just keep a replication cluster with no data write for some time and then oberserve the zk log refs ,the problem maybe be reproduced like mine.) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870944#comment-16870944 ] leizhang commented on HBASE-22620: -- I view the code that create the entryBatch,just as follows : {code:java} ReplicationSourceWALReaderThread.class -> run() -> entryStream.hasNext() -> tryAdvanceEntry() ->checkReader()->openNextLog()->readNextEntryAndSetPosition() {code} when reaching the end of a wal file ,it will switch to the next hlog, so the current log position update logic is correct,and all the hlog are available by replication source endpoint and switch correctly no matter whether we have entry to replicate, but the zk refs cleaning up job may be blocked due to the blocking queue's take() method. Imagine a cluster configured replication(set hbase.replication to true) but all tables don't open the replication property, all the log of the cluster will be keep. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870875#comment-16870875 ] leizhang commented on HBASE-22620: -- Thank you, you can just keep a replication cluster with no data write for some time and then oberserve the zk log refs ,the problem maybe be reproduced like mine. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870854#comment-16870854 ] leizhang commented on HBASE-22620: -- {code:java} WALEntryBatch entryBatch = entryReader.take(); shipEdits(entryBatch); {code} i see the entryReader.take() method ,it take entry from a LinkedBlockingQueue,and will block until the queue has entry to replicate, when the blockingqueue is empty, it will block and won't execute the shipEdits() method. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870837#comment-16870837 ] leizhang commented on HBASE-22620: -- No, we don't modify the logic in ReplicationSource Endpoint > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870829#comment-16870829 ] leizhang edited comment on HBASE-22620 at 6/24/19 6:12 AM: --- Thank you for reply ,no data to replicate, I mean that no entry in wal need to be replicated from the log queue, because I see the logic cleaning hfile refs is in the shipEdits() method in ReplicationSourceShipperThread.class, parts of the code in shipEdits() are as follows: {code:java} WALEntryBatch entryBatch = entryReader.take(); // send the entryBath to the target cluster shipEdits(entryBatch); {code} then in the shipEdits() method,the method call chains are : {code:java} shipEdits() ->updateLogPosition() ->ReplicationSourceManager.logPositionAndCleanOldLogs(){code} I see that only when entryReader has entryBatch to replicate, then the logPositionAndCleanOldLogs() method will be called and the oldLogs refs will be removed from zk normally( under znode /hbase/replication/rs/\{resionserver}/\{peerId}/). but when no entry to be replicated,(for example, there are no table regions that open the replication property on regionserver A but the cluster opend the replication),the logPositionAndCleanOldLogs() will never be triggered on A ,then the zk refs will remain in the zk forerver,the real log file on hdfs will not be cleanerd,either. After a long time, with the log roll mechanism,lots of log files will accumulate, and can't be removed normally due to the ref on zk. consider two situations: 1、no data in a wal file 2、there are entries in a wal file,but won't be replicated later(the table doesn't open the replitation property,so the entries will be skip) just as you say, the entire wal file will be also read, and the current replating log file position can be updated normally, but the oldLog fille refs clean up logic will never be triggered, because there are no entry need to replicated. the real phenomenon on my test cluster also valid that. was (Author: zl_cn_hbase): Thank you for reply ,no data to replicate, I mean that no entry in wal need to be replicated from the log queue, because I see the logic cleaning hfile refs is in the shipEdits() method in ReplicationSourceShipperThread.class, parts of the code in shipEdits() are as follows: {code:java} WALEntryBatch entryBatch = entryReader.take(); // send the entryBath to the target cluster shipEdits(entryBatch); {code} then in the shipEdits() method,the method call chains are : {code:java} shipEdits() ->updateLogPosition() ->ReplicationSourceManager.logPositionAndCleanOldLogs(){code} I see that only when entryReader has entryBatch to replicate, then the logPositionAndCleanOldLogs() method will be called and the oldLogs refs will be removed from zk normally( under znode /hbase/replication/rs/\{resionserver}/\{peerId}/). but when no entry to be replicated,(for example, there are no table regions that open the replication property on regionserver A ),the logPositionAndCleanOldLogs() will never be triggered on A ,then the zk refs will remain in the zk forerver,the real log file on hdfs will not be cleanerd,either. After a long time, with the log roll mechanism,lots of log files will accumulate, and can't be removed normally due to the ref on zk. consider two situations: 1、no data in a wal file 2、there are entries in a wal file,but won't be replicated later(the table doesn't open the replitation property,so the entries will be skip) just as you say, the entire wal file will be also read, and the current replating log file position can be updated normally, but the oldLog fille refs clean up logic will never be triggered, because there are no entry need to replicated. the real phenomenon on my test cluster also valid that. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in t
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Summary: When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated (was: When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to replicate) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to be replicated > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to replicate
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Summary: When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to replicate (was: When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no wal entry need to replicate > > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870829#comment-16870829 ] leizhang commented on HBASE-22620: -- Thank you for reply ,no data to replicate, I mean that no entry in wal need to be replicated from the log queue, because I see the logic cleaning hfile refs is in the shipEdits() method in ReplicationSourceShipperThread.class, parts of the code in shipEdits() are as follows: {code:java} WALEntryBatch entryBatch = entryReader.take(); // send the entryBath to the target cluster shipEdits(entryBatch); {code} then in the shipEdits() method,the method call chains are : {code:java} shipEdits() ->updateLogPosition() ->ReplicationSourceManager.logPositionAndCleanOldLogs(){code} I see that only when entryReader has entryBatch to replicate, then the logPositionAndCleanOldLogs() method will be called and the oldLogs refs will be removed from zk normally( under znode /hbase/replication/rs/\{resionserver}/\{peerId}/). but when no entry to be replicated,(for example, there are no table regions that open the replication property on regionserver A ),the logPositionAndCleanOldLogs() will never be triggered on A ,then the zk refs will remain in the zk forerver,the real log file on hdfs will not be cleanerd,either. After a long time, with the log roll mechanism,lots of log files will accumulate, and can't be removed normally due to the ref on zk. consider two situations: 1、no data in a wal file 2、there are entries in a wal file,but won't be replicated later(the table doesn't open the replitation property,so the entries will be skip) just as you say, the entire wal file will be also read, and the current replating log file position can be updated normally, but the oldLog fille refs clean up logic will never be triggered, because there are no entry need to replicated. the real phenomenon on my test cluster also valid that. > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no data to replicate > -- > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Description: When I open the replication feature on my hbase cluster (20 regionserver nodes), for example, I create a table with 3 regions, which opened on 3 regionservers of 20. Due to no data to replicate ,the left 17 nodes accumulate lots of wal references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned up, which cause lots of wal file on hdfs will not be cleaned up either. When I check my test cluster after about four months, it accumulates about 5w wal files in the oldWal directory on hdfs. The source code shows that only there are data to be replicated, and after some data is replicated in the source endpoint, then it will executed the useless wal file check, and clean their references on zk, and the hdfs useless wal files will be cleaned up normally. So I think do we need other method to trigger the useless wal cleaning job in a replication cluster? May be in the replication progress report schedule task (just like ReplicationStatisticsTask.class) (was: When I open the replication feature on my hbase cluster (20 regionserver nodes), for example, I create a table with 3 regions, which opened on 3 regionservers of 20. Due to no data to replicate ,the left 17 nodes accumulate lots of wal references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned up, which cause lots of wal file on hdfs will not be cleaned up either. When I check my test cluster after about four months, it accumulates about 5w wal files in the oldWal directory on hdfs. The source code shows that only there are data to be replicated, and after some data is replicated in the source endpoint, then it will executed the useless wal file check, and clean their references on zk, and the hdfs useless wal files will be cleaned up normally. So I think do we need other method to trigger the useless wal cleaning job in a replication cluster? May be in the replication progress report schedule task (ReplicationStatisticsTask.class)) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no data to replicate > -- > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (just like ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Description: When I open the replication feature on my hbase cluster (20 regionserver nodes), for example, I create a table with 3 regions, which opened on 3 regionservers of 20. Due to no data to replicate ,the left 17 nodes accumulate lots of wal references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned up, which cause lots of wal file on hdfs will not be cleaned up either. When I check my test cluster after about four months, it accumulates about 5w wal files in the oldWal directory on hdfs. The source code shows that only there are data to be replicated, and after some data is replicated in the source endpoint, then it will executed the useless wal file check, and clean their references on zk, and the hdfs useless wal files will be cleaned up normally. So I think do we need other method to trigger the useless wal cleaning job in a replication cluster? May be in the replication progress report schedule task (ReplicationStatisticsTask.class) (was: When I open the replication feature on my hbase cluster (20 regionserver nodes), for example, I create a table with 3 regions, which opened on 3 regionservers of 20. Due to no data to replicate in the source cluster ,then the left 17 nodes accumulate lots of wal references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned up, which cause lots of wal file on hdfs will not be cleaned up either. When I check my test cluster after about four months, it accumulates about 5w wal files in the oldWal directory on hdfs. The source code shows that only there are data to be replicated, and after some data is replicated in the source endpoint, then it will executed the useless wal file check, and clean their references on zk, and the hdfs useless wal files will be cleaned up normally. So I think do we need other method to trigger the useless wal cleaning job in a replication cluster? May be in the replication progress report schedule task (ReplicationStatisticsTask.class)) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no data to replicate > -- > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate ,the left 17 nodes > accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Issue Type: Improvement (was: Bug) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no data to replicate > -- > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Improvement > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate in the source cluster ,then > the left 17 nodes accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (HBASE-22619) Hbase
[ https://issues.apache.org/jira/browse/HBASE-22619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang resolved HBASE-22619. -- Resolution: Not A Bug > Hbase > - > > Key: HBASE-22619 > URL: https://issues.apache.org/jira/browse/HBASE-22619 > Project: HBase > Issue Type: Bug >Reporter: leizhang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate
[ https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leizhang updated HBASE-22620: - Summary: When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate (was: When a cluster open replication,regionserver will not clean the walLog reference on zk due to no data to replicate) > When a cluster open replication,regionserver will not clean up the walLog > references on zk due to no data to replicate > -- > > Key: HBASE-22620 > URL: https://issues.apache.org/jira/browse/HBASE-22620 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.3, 1.4.9 >Reporter: leizhang >Priority: Major > > When I open the replication feature on my hbase cluster (20 regionserver > nodes), for example, I create a table with 3 regions, which opened on 3 > regionservers of 20. Due to no data to replicate in the source cluster ,then > the left 17 nodes accumulate lots of wal references on the zk node > "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned > up, which cause lots of wal file on hdfs will not be cleaned up either. When > I check my test cluster after about four months, it accumulates about 5w wal > files in the oldWal directory on hdfs. The source code shows that only there > are data to be replicated, and after some data is replicated in the source > endpoint, then it will executed the useless wal file check, and clean their > references on zk, and the hdfs useless wal files will be cleaned up normally. > So I think do we need other method to trigger the useless wal cleaning job in > a replication cluster? May be in the replication progress report schedule > task (ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-22620) When a cluster open replication,regionserver will not clean the walLog reference on zk due to no data to replicate
leizhang created HBASE-22620: Summary: When a cluster open replication,regionserver will not clean the walLog reference on zk due to no data to replicate Key: HBASE-22620 URL: https://issues.apache.org/jira/browse/HBASE-22620 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 1.4.9, 2.0.3 Reporter: leizhang When I open the replication feature on my hbase cluster (20 regionserver nodes), for example, I create a table with 3 regions, which opened on 3 regionservers of 20. Due to no data to replicate in the source cluster ,then the left 17 nodes accumulate lots of wal references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/" and will not be cleaned up, which cause lots of wal file on hdfs will not be cleaned up either. When I check my test cluster after about four months, it accumulates about 5w wal files in the oldWal directory on hdfs. The source code shows that only there are data to be replicated, and after some data is replicated in the source endpoint, then it will executed the useless wal file check, and clean their references on zk, and the hdfs useless wal files will be cleaned up normally. So I think do we need other method to trigger the useless wal cleaning job in a replication cluster? May be in the replication progress report schedule task (ReplicationStatisticsTask.class) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-22619) Hbase
leizhang created HBASE-22619: Summary: Hbase Key: HBASE-22619 URL: https://issues.apache.org/jira/browse/HBASE-22619 Project: HBase Issue Type: Bug Reporter: leizhang -- This message was sent by Atlassian JIRA (v7.6.3#76005)