[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId" and "enable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-08-18 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Summary: [Replication] When execute shell cmd "disable_peer peerId" and 
"enable_peer peerId",the  master web UI show a wrong number of SizeOfLogQueue  
(was: [Replication] When execute shell cmd "disable_peer peerId",the  master 
web UI show a wrong number of SizeOfLogQueue)

> [Replication] When execute shell cmd "disable_peer peerId" and "enable_peer 
> peerId",the  master web UI show a wrong number of SizeOfLogQueue
> 
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1'  and enable_peer '1', then i can see the SizeOfLogQueue 
> metric of all regionservers   +1 ,  after 10 times disable_peer ops  , it 
> will increase to 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) is 
> called , it will terminate the previous replication source and create a new 
> one.  and found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
> (HBASE-23231)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-08-18 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1'  and enable_peer '1', then i can see the SizeOfLogQueue metric of all 
regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
11, and  it will never decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) is 
called , it will terminate the previous replication source and create a new 
one.  and found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
(HBASE-23231)

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1'  and enable_peer '1', then i can see the SizeOfLogQueue metric of all 
regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
11, and  it will never decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) is 
called , it will terminate the previous replication source and create a new 
one.  and i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
(HBASE-23231)


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1'  and enable_peer '1', then i can see the SizeOfLogQueue 
> metric of all regionservers   +1 ,  after 10 times disable_peer ops  , it 
> will increase to 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) is 
> called , it will terminate the previous replication source and create a new 
> one.  and found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
> (HBASE-23231)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-08-18 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1'  and enable_peer '1', then i can see the SizeOfLogQueue metric of all 
regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
11, and  it will never decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) is 
called , it will terminate the previous replication source and create a new 
one.  and i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
(HBASE-23231)

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) is 
called , it will terminate the previous replication source and create a new 
one.  and i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
(HBASE-23231)


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1'  and enable_peer '1', then i can see the SizeOfLogQueue 
> metric of all regionservers   +1 ,  after 10 times disable_peer ops  , it 
> will increase to 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) is 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
> (HBASE-23231)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-29 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) is 
called , it will terminate the previous replication source and create a new 
one.  and i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
(HBASE-23231)

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
(HBASE-23231)


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) is 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
> (HBASE-23231)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-29 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Comment: was deleted

(was: h1. ReplicationSource do not update metrics after refresh)

> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
> (HBASE-23231)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-29 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
(HBASE-23231)

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maybe we should set true when 
execute terminate() ?


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, i think it's a sub issue of 
> (HBASE-23231)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166934#comment-17166934
 ] 

leizhang commented on HBASE-24781:
--

h1. ReplicationSource do not update metrics after refresh

> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, maybe we should set true when 
> execute terminate() ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, mabe we should set true when 
execute terminate() ?

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, mabe we should set true when 
> execute terminate() ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, maby we should set true when 
> execute terminate() ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maybe we should set true when 
execute terminate() ?

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, mabe we should set true when 
execute terminate() ?


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, maybe we should set true when 
> execute terminate() ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note 
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, maby we should set true when 
> execute terminate() ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block:
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note  //Do not clear metrics in the bellow code block
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note  //Do not clear metrics in the bellow code block:
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, maby we should set true when 
> execute terminate() ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will terminate the previous replication source and create a new one.  and 
i found the note 
{code:java}
ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
if (toRemove != null) {
  LOG.info("Terminate replication source for " + toRemove.getPeerId());
  // Do not clear metrics
  toRemove.terminate(terminateMessage, null, false);
}
{code}
 this cause the wrong number of sizeOfLogQueue, maby we should set true when 
execute terminate() ?

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to add a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue metric twice ?  
thx 

 


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will terminate the previous replication source and create a new 
> one.  and i found the note 
> {code:java}
> ReplicationSourceInterface toRemove = this.sources.put(peerId, src);
> if (toRemove != null) {
>   LOG.info("Terminate replication source for " + toRemove.getPeerId());
>   // Do not clear metrics
>   toRemove.terminate(terminateMessage, null, false);
> }
> {code}
>  this cause the wrong number of sizeOfLogQueue, maby we should set true when 
> execute terminate() ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to add a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue metric twice ?  
thx 

 

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to add a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 


> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will enqueue the  current wals to the source ,  maybe when the 
> current wal is already in the replication queue , we try to add a duplicated 
> wal to the source ,which cause the same wal increase the SizeOfLogQueue 
> metric twice ?  thx 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) [Replication] When execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Summary: [Replication] When execute shell cmd "disable_peer peerId",the  
master web UI show a wrong number of SizeOfLogQueue  (was: when execute shell 
cmd "disable_peer peerId",the  master web UI show a wrong number of 
SizeOfLogQueue)

> [Replication] When execute shell cmd "disable_peer peerId",the  master web UI 
> show a wrong number of SizeOfLogQueue
> ---
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will enqueue the  current wals to the source ,  maybe when the 
> current wal is already in the replication queue , we try to add a duplicated 
> wal to the source ,which cause the same wal increase the SizeOfLogQueue  
> twice ?  thx 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to add a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 

  was:
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to and a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 


> when execute shell cmd "disable_peer peerId",the  master web UI show a wrong 
> number of SizeOfLogQueue
> -
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will enqueue the  current wals to the source ,  maybe when the 
> current wal is already in the replication queue , we try to add a duplicated 
> wal to the source ,which cause the same wal increase the SizeOfLogQueue  
> twice ?  thx 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an peer with id 1,  when execute shell cmd disable_peer 
 '1' , then i can see the SizeOfLogQueue metric of all regionservers   +1 ,  
after 10 times disable_peer ops  , it will increase to 11, and  it will never 
decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to and a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 

  was:
  Supposed that we have an source peer with id 1,  when execute shell cmd 
disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
11, and  it will never decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to and a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 


> when execute shell cmd "disable_peer peerId",the  master web UI show a wrong 
> number of SizeOfLogQueue
> -
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will enqueue the  current wals to the source ,  maybe when the 
> current wal is already in the replication queue , we try to and a duplicated 
> wal to the source ,which cause the same wal increase the SizeOfLogQueue  
> twice ?  thx 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-24781:
-
Description: 
  Supposed that we have an source peer with id 1,  when execute shell cmd 
disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
11, and  it will never decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to and a duplicated wal to the 
source ,which cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 

  was:
  Supposed that we have an source peer with id 1,  when execute shell cmd 
disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
11, and  it will never decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to and a duplicated wal to the 
source ,and cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 


> when execute shell cmd "disable_peer peerId",the  master web UI show a wrong 
> number of SizeOfLogQueue
> -
>
> Key: HBASE-24781
> URL: https://issues.apache.org/jira/browse/HBASE-24781
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.2.5
>Reporter: leizhang
>Priority: Major
>
>   Supposed that we have an source peer with id 1,  when execute shell cmd 
> disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
> regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
> 11, and  it will never decrease to 1 in fulture .
>   I can see the function ReplicationSourceManager.refreshSources(peerId) 
> called , it will enqueue the  current wals to the source ,  maybe when the 
> current wal is already in the replication queue , we try to and a duplicated 
> wal to the source ,which cause the same wal increase the SizeOfLogQueue  
> twice ?  thx 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24781) when execute shell cmd "disable_peer peerId",the master web UI show a wrong number of SizeOfLogQueue

2020-07-28 Thread leizhang (Jira)
leizhang created HBASE-24781:


 Summary: when execute shell cmd "disable_peer peerId",the  master 
web UI show a wrong number of SizeOfLogQueue
 Key: HBASE-24781
 URL: https://issues.apache.org/jira/browse/HBASE-24781
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 2.2.5
Reporter: leizhang


  Supposed that we have an source peer with id 1,  when execute shell cmd 
disable_peer  '1' , then i can see the SizeOfLogQueue metric of all 
regionservers   +1 ,  after 10 times disable_peer ops  , it will increase to 
11, and  it will never decrease to 1 in fulture .

  I can see the function ReplicationSourceManager.refreshSources(peerId) called 
, it will enqueue the  current wals to the source ,  maybe when the current wal 
is already in the replication queue , we try to and a duplicated wal to the 
source ,and cause the same wal increase the SizeOfLogQueue  twice ?  thx 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2020-07-27 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Comment: was deleted

(was: I think this problem still exist in Hbase2.x , when i use Hbase2.2.5 , I 
encounter the same problem .)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.1.0, 1.4.8, 1.4.9, 2.2.5
>Reporter: leizhang
>Assignee: yaojingyi
>Priority: Major
> Attachments: HBASE-22620.branch-1.4.001.patch
>
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2020-07-27 Thread leizhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166100#comment-17166100
 ] 

leizhang commented on HBASE-22620:
--

I think this problem still exist in Hbase2.x , when i use Hbase2.2.5 , I 
encounter the same problem .

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.1.0, 1.4.8, 1.4.9, 2.2.5
>Reporter: leizhang
>Assignee: yaojingyi
>Priority: Major
> Attachments: HBASE-22620.branch-1.4.001.patch
>
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2020-07-27 Thread leizhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Affects Version/s: (was: 1.2.4)
   2.2.5

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.1.0, 1.4.8, 1.4.9, 2.2.5
>Reporter: leizhang
>Assignee: yaojingyi
>Priority: Major
> Attachments: HBASE-22620.branch-1.4.001.patch
>
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-07-01 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876588#comment-16876588
 ] 

leizhang commented on HBASE-22620:
--

{code:java}
//代码占位符
WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout);
if (entryBatch == null) {
  
manager.cleanOldLogs(this.getCurrentPath().getName(),peerClusterZnode,this.replicationQueueInfo.isQueueRecovered());
  continue;
}
shipEdits(entryBatch);
{code}
Thank you very much ! At present  I clean the old  log zk refs by call the 
method cleanOldLogs when the entryBath is empty,it indeed worked, but i don't 
know whether it is  appropriate to put the logic here, expect your patch.

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 2.1.0, 1.4.8, 1.4.9
>Reporter: leizhang
>Assignee: yaojingyi
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-07-01 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Affects Version/s: 2.1.0
   1.4.8

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 2.1.0, 1.4.8, 1.4.9
>Reporter: leizhang
>Assignee: yaojingyi
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-27 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Fix Version/s: (was: 2.1.0)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-27 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874710#comment-16874710
 ] 

leizhang commented on HBASE-22620:
--

not only the pressure on zk.consider when there is a large amount of  data 
write in, which is writen by non-replication tables , the hlog is not empty, it 
will also accumulate hlogs on hdfs . this is the reason which causes hlogs  
under  hdfs directory /oldWALs  reached about 30TB. 

 the  -HBASE-20206-  may be not helpful for this issue. I view the code of 
HBASE2.1.0 and find the logic is :

 
{code:java}
WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout);
if (entryBatch == null) {
  // since there is no logs need to replicate, we refresh the ageOfLastShippedOp
  
source.getSourceMetrics().setAgeOfLastShippedOp(EnvironmentEdgeManager.currentTime(),
walGroupId);
  continue;
}
// the NO_MORE_DATA instance has no path so do not call shipEdits
if (entryBatch == WALEntryBatch.NO_MORE_DATA) {
  noMoreData();
} else {
  shipEdits(entryBatch);
}
{code}
the entryReader.take() in hbase 1.4.9 has been replaced by  
entryReader.poll(getEntriesTimeout) , indeed, the thread will not be blocked 
any more . but if entryBatch is null , it only update the age of the metric , 
and then the loop continue ,  the shipEdit() method is still not called . could 
you show me when entry is null , where is the logic to handle the old hlogs ? 
Thank you .

 

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 1.4.9
>Reporter: leizhang
>Priority: Major
> Fix For: 2.1.0
>
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-27 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang reopened HBASE-22620:
--

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 1.4.9
>Reporter: leizhang
>Priority: Major
> Fix For: 2.1.0
>
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-27 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Comment: was deleted

(was: thank you very much !  I check the  source code of Hbase2.1.0 ,and find 
the 

entryReader.take()   has been replaced by entryReader.poll(getEntriesTimeout);

then the tread will not be blocked and will excute the following logic, and the 
problem can be solved !)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 1.4.9
>Reporter: leizhang
>Priority: Major
> Fix For: 2.1.0
>
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-27 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang resolved HBASE-22620.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 1.4.9
>Reporter: leizhang
>Priority: Major
> Fix For: 2.1.0
>
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-27 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874063#comment-16874063
 ] 

leizhang commented on HBASE-22620:
--

thank you very much !  I check the  source code of Hbase2.1.0 ,and find the 

entryReader.take()   has been replaced by entryReader.poll(getEntriesTimeout);

then the tread will not be blocked and will excute the following logic, and the 
problem can be solved !

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Affects Version/s: (was: 2.0.3)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Affects Version/s: 1.2.4

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 1.2.4, 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873771#comment-16873771
 ] 

leizhang commented on HBASE-22620:
--

Yesterday, I found the Hbase 1.2.4 also exists the same problem

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873771#comment-16873771
 ] 

leizhang edited comment on HBASE-22620 at 6/27/19 2:41 AM:
---

Yesterday,when I check our cluster prod, I found the Hbase 1.2.4 also exists 
the same problem


was (Author: zl_cn_hbase):
Yesterday, I found the Hbase 1.2.4 also exists the same problem

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873770#comment-16873770
 ] 

leizhang commented on HBASE-22620:
--

Sorry , I only check the code in Hbase 2.x, I find it is same as 1.4.9 , 
accually I  haven't do any practice on hbase 2.x

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Description: When I open the replication feature on my hbase cluster (20 
regionserver nodes) and added a peer cluster, for example, I create a table 
with 3 regions with REPLICATION_SCOPE set to 1, which opened on 3 regionservers 
of 20. Due to no data(entryBatch) to replicate ,the left 17 nodes  accumulate 
lots of wal references on the zk node 
"/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned up, 
which cause lots of wal file on hdfs will not be cleaned up either. When I 
check my test cluster after about four months, it accumulates about 5w wal 
files in the oldWal directory on hdfs. The source code shows that only there 
are data to be replicated, and after some data is replicated in the source 
endpoint, then it will executed the useless wal file check, and clean their 
references on zk, and the hdfs useless wal files will be cleaned up normally. 
So I think do we need other method to trigger the useless wal cleaning job in a 
replication cluster? May be  in the  replication progress report  schedule task 
 (just like ReplicationStatisticsTask.class)  (was: When I open the replication 
feature on my hbase cluster (20 regionserver nodes), for example, I create a 
table with 3 regions, which opened on 3 regionservers of 20. Due to no data to 
replicate ,the left 17 nodes  accumulate lots of wal references on the zk node 
"/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned up, 
which cause lots of wal file on hdfs will not be cleaned up either. When I 
check my test cluster after about four months, it accumulates about 5w wal 
files in the oldWal directory on hdfs. The source code shows that only there 
are data to be replicated, and after some data is replicated in the source 
endpoint, then it will executed the useless wal file check, and clean their 
references on zk, and the hdfs useless wal files will be cleaned up normally. 
So I think do we need other method to trigger the useless wal cleaning job in a 
replication cluster? May be  in the  replication progress report  schedule task 
 (just like ReplicationStatisticsTask.class))

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes) and added a peer cluster, for example, I create a table with 3 regions 
> with REPLICATION_SCOPE set to 1, which opened on 3 regionservers of 20. Due 
> to no data(entryBatch) to replicate ,the left 17 nodes  accumulate lots of 
> wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Issue Type: Bug  (was: Improvement)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873305#comment-16873305
 ] 

leizhang commented on HBASE-22620:
--

 

did you expect one just like this ?  Hbase version 1.4.9
{code:java}
main-EventThread.replicationSource,1.replicationSource.xx.hbase.lq2%2C16020%2C1561379323483,1"
 #153306 daemon prio=5 os_prio=0 tid=0x7f0844681800 nid=0xe49 waiting on 
condition [0x7ef573a1d000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x7f05ba84c3d8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)

"main.replicationSource,1-EventThread" #153305 daemon prio=5 os_prio=0 
tid=0x7f0844765800 nid=0xe48 waiting on condition [0x7ef57391c000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x7f05ba847f50> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
{code}
once the peer was added, if no entry to replicate, the hlog refs will 
accumulate on zk, and hlog accumulate in /oldWALs

I  also found that due to  the hug data amout (about 30T hlog file under 
/hbase/oldWALs ), when I execute the  command   "remove_peer 'peer1' " on my 
cluster, the master show the log bellow, and all regionservers abort
{code:java}
//代码占位符
ERROR [B.defaultRpcServer.handler=172,queue=22,port=16000] 
master.MasterRpcServices: Region server ,16020,1503477315622 
reported a fatal error:
ABORTING region server xxx,16020,1503477315622: Failed to delete 
queue (queueId=peer1)
Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:672)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1671)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1413)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1280)
at 
org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeQueue(ReplicationQueuesZKImpl.java:93)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.deleteSource(ReplicationSourceManager.java:298)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:579)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.peerRemoved(ReplicationSourceManager.java:590)
at 
org.apache.hadoop.hbase.replication.ReplicationTrackerZKImpl$PeersWatcher.nodeDeleted(ReplicationTrackerZKImpl.java:171)
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:628)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
{code}
 So i have to remove the hlog refs on zk manually and let regionserver clean 
hlogs nomally.

 

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17

[jira] [Comment Edited] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-26 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870944#comment-16870944
 ] 

leizhang edited comment on HBASE-22620 at 6/26/19 8:12 AM:
---

  I view the code that create the entryBatch,just as follows :
{code:java}
ReplicationSourceWALReaderThread.class -> run() -> entryStream.hasNext() -> 
tryAdvanceEntry() ->checkReader()->openNextLog()->readNextEntryAndSetPosition()
{code}
  when reaching the end of a wal file ,it will switch to the next hlog, so the 
current log position update logic is correct,and all the hlog are available by 
replication source endpoint and switch correctly no matter whether we have 
entry to replicate,  but the zk refs cleaning up job may be blocked due to the  
blocking queue's take()  method. 

Consider a cluster configured replication(set hbase.replication to true,and 
peer has been set) but all tables don't open the replication property, all the 
log of the cluster will be keep.  


was (Author: zl_cn_hbase):
  I view the code that create the entryBatch,just as follows :
{code:java}
ReplicationSourceWALReaderThread.class -> run() -> entryStream.hasNext() -> 
tryAdvanceEntry() ->checkReader()->openNextLog()->readNextEntryAndSetPosition()
{code}
  when reaching the end of a wal file ,it will switch to the next hlog, so the 
current log position update logic is correct,and all the hlog are available by 
replication source endpoint and switch correctly no matter whether we have 
entry to replicate,  but the zk refs cleaning up job may be blocked due to the  
blocking queue's take()  method. 

  Imagine a cluster configured replication(set hbase.replication to true) but 
all tables don't open the replication property, all the log of the cluster will 
be keep.  

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871929#comment-16871929
 ] 

leizhang commented on HBASE-22620:
--

I print the stack info of my regionserver, and find the thread blocking here:
{code:java}
"regionserver/hostxxx/ipxx:16020.replicationSource,1-EventThread" 
#421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on 
condition [0x7ef574423000]
The network connection was aborted by the local system.
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x7eff45024a28> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
{code}

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871918#comment-16871918
 ] 

leizhang commented on HBASE-22620:
--

"regionserver/x:16020.replicationSource,1-EventThread" #421 daemon 
prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on condition 
[0x7ef574423000]
  java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for <0x7eff45024a28> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)//blocking
 here  at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Comment: was deleted

(was: "regionserver/x:16020.replicationSource,1-EventThread" #421 
daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on condition 
[0x7ef574423000]
  java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for <0x7eff45024a28> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)//blocking
 here  at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501))

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871917#comment-16871917
 ] 

leizhang commented on HBASE-22620:
--

I print the stack info of my regionserver,and find the thread blocking here:
{code:java}
"regionserver/hbase-zeus-26-242-225.hadoop.lq2/10.26.242.225:16020.replicationSource,1-EventThread"
 #421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on 
condition [0x7ef574423000]
  java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for <0x7eff45024a28> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
//blocking here
  at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
{code}

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Comment: was deleted

(was: I print the stack info of my regionserver,and find the thread blocking 
here:
{code:java}
"regionserver/hbase-zeus-26-242-225.hadoop.lq2/10.26.242.225:16020.replicationSource,1-EventThread"
 #421 daemon prio=5 os_prio=0 tid=0x7f084412d800 nid=0x6b01c waiting on 
condition [0x7ef574423000]
  java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for <0x7eff45024a28> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
//blocking here
  at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
{code})

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870983#comment-16870983
 ] 

leizhang commented on HBASE-22620:
--

sorry , i mean logPositionAndCleanOldLogs()

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Comment: was deleted

(was: Thank you, you can just  keep a replication cluster with no data write 
for some time and then oberserve the zk log refs ,the problem maybe be 
reproduced like mine.)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870944#comment-16870944
 ] 

leizhang commented on HBASE-22620:
--

  I view the code that create the entryBatch,just as follows :
{code:java}
ReplicationSourceWALReaderThread.class -> run() -> entryStream.hasNext() -> 
tryAdvanceEntry() ->checkReader()->openNextLog()->readNextEntryAndSetPosition()
{code}
  when reaching the end of a wal file ,it will switch to the next hlog, so the 
current log position update logic is correct,and all the hlog are available by 
replication source endpoint and switch correctly no matter whether we have 
entry to replicate,  but the zk refs cleaning up job may be blocked due to the  
blocking queue's take()  method. 

  Imagine a cluster configured replication(set hbase.replication to true) but 
all tables don't open the replication property, all the log of the cluster will 
be keep.  

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-24 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870875#comment-16870875
 ] 

leizhang commented on HBASE-22620:
--

Thank you, you can just  keep a replication cluster with no data write for some 
time and then oberserve the zk log refs ,the problem maybe be reproduced like 
mine.

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-23 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870854#comment-16870854
 ] 

leizhang commented on HBASE-22620:
--

{code:java}
WALEntryBatch entryBatch = entryReader.take();
shipEdits(entryBatch);
{code}
i see the entryReader.take() method ,it take entry from a 
LinkedBlockingQueue,and will block until the queue has entry to replicate,

when the blockingqueue is empty, it will block and won't execute the 
shipEdits()  method. 

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-23 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870837#comment-16870837
 ] 

leizhang commented on HBASE-22620:
--

No, we don't modify the logic in ReplicationSource Endpoint

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-23 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870829#comment-16870829
 ] 

leizhang edited comment on HBASE-22620 at 6/24/19 6:12 AM:
---

Thank you for reply ,no data to replicate, I mean that no entry in wal need to  
be replicated from the log queue, because I see the  logic cleaning hfile refs 
is in the shipEdits() method in ReplicationSourceShipperThread.class,  parts of 
the code in shipEdits() are as follows:
{code:java}
WALEntryBatch entryBatch = entryReader.take();
// send the entryBath to the target cluster
shipEdits(entryBatch);
{code}
then in the shipEdits() method,the method call chains are :
{code:java}
shipEdits() ->updateLogPosition() 
->ReplicationSourceManager.logPositionAndCleanOldLogs(){code}
I see that  only when entryReader has entryBatch to replicate, then the 
logPositionAndCleanOldLogs() method will be called and the oldLogs refs will be 
removed from zk normally( under znode 
/hbase/replication/rs/\{resionserver}/\{peerId}/). but when no entry to be 
replicated,(for example, there are no table regions that open the replication 
property on regionserver A but the cluster opend the replication),the 
logPositionAndCleanOldLogs() will never be  triggered on A ,then the zk refs 
will remain in the zk forerver,the real log file on hdfs will not be 
cleanerd,either.  After a long time, with the log  roll mechanism,lots of log 
files will accumulate, and can't be removed normally due to the ref on zk.

consider two situations:

1、no data in a wal file 

2、there are entries in a wal file,but won't be replicated later(the table 
doesn't open the replitation property,so the entries will be skip)

just as you say, the entire wal file will be also read, and the current 
replating log file position can be updated normally, but the oldLog fille refs 
clean up logic will never be triggered, because  there are no entry need to  
replicated.  the real phenomenon on my test cluster also valid that.


was (Author: zl_cn_hbase):
Thank you for reply ,no data to replicate, I mean that no entry in wal need to  
be replicated from the log queue, because I see the  logic cleaning hfile refs 
is in the shipEdits() method in ReplicationSourceShipperThread.class,  parts of 
the code in shipEdits() are as follows:
{code:java}
WALEntryBatch entryBatch = entryReader.take();
// send the entryBath to the target cluster
shipEdits(entryBatch);
{code}
then in the shipEdits() method,the method call chains are :
{code:java}
shipEdits() ->updateLogPosition() 
->ReplicationSourceManager.logPositionAndCleanOldLogs(){code}
I see that  only when entryReader has entryBatch to replicate, then the 
logPositionAndCleanOldLogs() method will be called and the oldLogs refs will be 
removed from zk normally( under znode 
/hbase/replication/rs/\{resionserver}/\{peerId}/). but when no entry to be 
replicated,(for example, there are no table regions that open the replication 
property on regionserver A ),the logPositionAndCleanOldLogs() will never be  
triggered on A ,then the zk refs will remain in the zk forerver,the real log 
file on hdfs will not be cleanerd,either.  After a long time, with the log  
roll mechanism,lots of log files will accumulate, and can't be removed normally 
due to the ref on zk.

consider two situations:

1、no data in a wal file 

2、there are entries in a wal file,but won't be replicated later(the table 
doesn't open the replitation property,so the entries will be skip)

just as you say, the entire wal file will be also read, and the current 
replating log file position can be updated normally, but the oldLog fille refs 
clean up logic will never be triggered, because  there are no entry need to  
replicated.  the real phenomenon on my test cluster also valid that.

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in t

[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to be replicated

2019-06-23 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Summary: When a cluster open replication,regionserver will not clean up the 
walLog references on zk due to no wal entry need to be replicated  (was: When a 
cluster open replication,regionserver will not clean up the walLog references 
on zk due to no wal entry need to replicate)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to be replicated
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no wal entry need to replicate

2019-06-23 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Summary: When a cluster open replication,regionserver will not clean up the 
walLog references on zk due to no wal entry need to replicate  (was: When a 
cluster open replication,regionserver will not clean up the walLog references 
on zk due to no data to replicate)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no wal entry need to replicate
> 
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate

2019-06-23 Thread leizhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870829#comment-16870829
 ] 

leizhang commented on HBASE-22620:
--

Thank you for reply ,no data to replicate, I mean that no entry in wal need to  
be replicated from the log queue, because I see the  logic cleaning hfile refs 
is in the shipEdits() method in ReplicationSourceShipperThread.class,  parts of 
the code in shipEdits() are as follows:
{code:java}
WALEntryBatch entryBatch = entryReader.take();
// send the entryBath to the target cluster
shipEdits(entryBatch);
{code}
then in the shipEdits() method,the method call chains are :
{code:java}
shipEdits() ->updateLogPosition() 
->ReplicationSourceManager.logPositionAndCleanOldLogs(){code}
I see that  only when entryReader has entryBatch to replicate, then the 
logPositionAndCleanOldLogs() method will be called and the oldLogs refs will be 
removed from zk normally( under znode 
/hbase/replication/rs/\{resionserver}/\{peerId}/). but when no entry to be 
replicated,(for example, there are no table regions that open the replication 
property on regionserver A ),the logPositionAndCleanOldLogs() will never be  
triggered on A ,then the zk refs will remain in the zk forerver,the real log 
file on hdfs will not be cleanerd,either.  After a long time, with the log  
roll mechanism,lots of log files will accumulate, and can't be removed normally 
due to the ref on zk.

consider two situations:

1、no data in a wal file 

2、there are entries in a wal file,but won't be replicated later(the table 
doesn't open the replitation property,so the entries will be skip)

just as you say, the entire wal file will be also read, and the current 
replating log file position can be updated normally, but the oldLog fille refs 
clean up logic will never be triggered, because  there are no entry need to  
replicated.  the real phenomenon on my test cluster also valid that.

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no data to replicate
> --
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate

2019-06-23 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Description: When I open the replication feature on my hbase cluster (20 
regionserver nodes), for example, I create a table with 3 regions, which opened 
on 3 regionservers of 20. Due to no data to replicate ,the left 17 nodes  
accumulate lots of wal references on the zk node 
"/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned up, 
which cause lots of wal file on hdfs will not be cleaned up either. When I 
check my test cluster after about four months, it accumulates about 5w wal 
files in the oldWal directory on hdfs. The source code shows that only there 
are data to be replicated, and after some data is replicated in the source 
endpoint, then it will executed the useless wal file check, and clean their 
references on zk, and the hdfs useless wal files will be cleaned up normally. 
So I think do we need other method to trigger the useless wal cleaning job in a 
replication cluster? May be  in the  replication progress report  schedule task 
 (just like ReplicationStatisticsTask.class)  (was: When I open the replication 
feature on my hbase cluster (20 regionserver nodes), for example, I create a 
table with 3 regions, which opened on 3 regionservers of 20. Due to no data to 
replicate ,the left 17 nodes  accumulate lots of wal references on the zk node 
"/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned up, 
which cause lots of wal file on hdfs will not be cleaned up either. When I 
check my test cluster after about four months, it accumulates about 5w wal 
files in the oldWal directory on hdfs. The source code shows that only there 
are data to be replicated, and after some data is replicated in the source 
endpoint, then it will executed the useless wal file check, and clean their 
references on zk, and the hdfs useless wal files will be cleaned up normally. 
So I think do we need other method to trigger the useless wal cleaning job in a 
replication cluster? May be  in the  replication progress report  schedule task 
 (ReplicationStatisticsTask.class))

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no data to replicate
> --
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (just like ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate

2019-06-23 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Description: When I open the replication feature on my hbase cluster (20 
regionserver nodes), for example, I create a table with 3 regions, which opened 
on 3 regionservers of 20. Due to no data to replicate ,the left 17 nodes  
accumulate lots of wal references on the zk node 
"/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned up, 
which cause lots of wal file on hdfs will not be cleaned up either. When I 
check my test cluster after about four months, it accumulates about 5w wal 
files in the oldWal directory on hdfs. The source code shows that only there 
are data to be replicated, and after some data is replicated in the source 
endpoint, then it will executed the useless wal file check, and clean their 
references on zk, and the hdfs useless wal files will be cleaned up normally. 
So I think do we need other method to trigger the useless wal cleaning job in a 
replication cluster? May be  in the  replication progress report  schedule task 
 (ReplicationStatisticsTask.class)  (was: When I open the replication feature 
on my hbase cluster (20 regionserver nodes), for example, I create a table with 
3 regions, which opened on 3 regionservers of 20. Due to no data to replicate 
in the source cluster ,then the left 17 nodes  accumulate lots of wal 
references on the zk node "/hbase/replication/rs/\{resionserver}/\{peerId}/"  
and will not be cleaned up, which cause lots of wal file on hdfs will not be 
cleaned up either. When I check my test cluster after about four months, it 
accumulates about 5w wal files in the oldWal directory on hdfs. The source code 
shows that only there are data to be replicated, and after some data is 
replicated in the source endpoint, then it will executed the useless wal file 
check, and clean their references on zk, and the hdfs useless wal files will be 
cleaned up normally. So I think do we need other method to trigger the useless 
wal cleaning job in a replication cluster? May be  in the  replication progress 
report  schedule task  (ReplicationStatisticsTask.class))

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no data to replicate
> --
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate ,the left 17 nodes  
> accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate

2019-06-23 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Issue Type: Improvement  (was: Bug)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no data to replicate
> --
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate in the source cluster ,then 
> the left 17 nodes  accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-22619) Hbase

2019-06-23 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang resolved HBASE-22619.
--
Resolution: Not A Bug

> Hbase
> -
>
> Key: HBASE-22619
> URL: https://issues.apache.org/jira/browse/HBASE-22619
> Project: HBase
>  Issue Type: Bug
>Reporter: leizhang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22620) When a cluster open replication,regionserver will not clean up the walLog references on zk due to no data to replicate

2019-06-23 Thread leizhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leizhang updated HBASE-22620:
-
Summary: When a cluster open replication,regionserver will not clean up the 
walLog references on zk due to no data to replicate  (was: When a cluster open 
replication,regionserver will not clean the walLog reference on zk due to no 
data to replicate)

> When a cluster open replication,regionserver will not clean up the walLog 
> references on zk due to no data to replicate
> --
>
> Key: HBASE-22620
> URL: https://issues.apache.org/jira/browse/HBASE-22620
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.3, 1.4.9
>Reporter: leizhang
>Priority: Major
>
> When I open the replication feature on my hbase cluster (20 regionserver 
> nodes), for example, I create a table with 3 regions, which opened on 3 
> regionservers of 20. Due to no data to replicate in the source cluster ,then 
> the left 17 nodes  accumulate lots of wal references on the zk node 
> "/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned 
> up, which cause lots of wal file on hdfs will not be cleaned up either. When 
> I check my test cluster after about four months, it accumulates about 5w wal 
> files in the oldWal directory on hdfs. The source code shows that only there 
> are data to be replicated, and after some data is replicated in the source 
> endpoint, then it will executed the useless wal file check, and clean their 
> references on zk, and the hdfs useless wal files will be cleaned up normally. 
> So I think do we need other method to trigger the useless wal cleaning job in 
> a replication cluster? May be  in the  replication progress report  schedule 
> task  (ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-22620) When a cluster open replication,regionserver will not clean the walLog reference on zk due to no data to replicate

2019-06-23 Thread leizhang (JIRA)
leizhang created HBASE-22620:


 Summary: When a cluster open replication,regionserver will not 
clean the walLog reference on zk due to no data to replicate
 Key: HBASE-22620
 URL: https://issues.apache.org/jira/browse/HBASE-22620
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 1.4.9, 2.0.3
Reporter: leizhang


When I open the replication feature on my hbase cluster (20 regionserver 
nodes), for example, I create a table with 3 regions, which opened on 3 
regionservers of 20. Due to no data to replicate in the source cluster ,then 
the left 17 nodes  accumulate lots of wal references on the zk node 
"/hbase/replication/rs/\{resionserver}/\{peerId}/"  and will not be cleaned up, 
which cause lots of wal file on hdfs will not be cleaned up either. When I 
check my test cluster after about four months, it accumulates about 5w wal 
files in the oldWal directory on hdfs. The source code shows that only there 
are data to be replicated, and after some data is replicated in the source 
endpoint, then it will executed the useless wal file check, and clean their 
references on zk, and the hdfs useless wal files will be cleaned up normally. 
So I think do we need other method to trigger the useless wal cleaning job in a 
replication cluster? May be  in the  replication progress report  schedule task 
 (ReplicationStatisticsTask.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-22619) Hbase

2019-06-23 Thread leizhang (JIRA)
leizhang created HBASE-22619:


 Summary: Hbase
 Key: HBASE-22619
 URL: https://issues.apache.org/jira/browse/HBASE-22619
 Project: HBase
  Issue Type: Bug
Reporter: leizhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)