[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-11-12 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230989#comment-17230989
 ] 

Hudson commented on HBASE-24779:


Results for branch branch-2.2
[build #116 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/116/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/116//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/116//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/116//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.4.0
>
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-11-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230115#comment-17230115
 ] 

Hudson commented on HBASE-24779:


Results for branch branch-2.3
[build #102 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/102/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/102/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/102/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/102/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/102/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.4.0
>
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-09-11 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194602#comment-17194602
 ] 

Hudson commented on HBASE-24779:


Results for branch branch-2.2
[build #58 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/58/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/58//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/58//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.2/58//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.4.0
>
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-08-03 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170496#comment-17170496
 ] 

Josh Elser commented on HBASE-24779:


{noformat}
2020-08-03 22:01:53,065 INFO [mizar:16020Replication Statistics #0] 
regionserver.Replication(271): Global stats: WAL Edits Buffer Used=684288B, 
Limit=268435456B
Normal source for cluster 1: Total replicated edits: 46200, current progress: 
walGroup [mizar.local%2C16020%2C1596506315322]: currently replicating from: 
hdfs://mizar.local:8020/hbase-3/WALs/mizar.local,16020,1596506315322/mizar.local%2C16020%2C1596506315322.1596506319356
 at position: 4292571{{noformat}
{noformat}
{
  "name": "Hadoop:service=HBase,name=RegionServer,sub=Replication",
  "modelerType": "RegionServer,sub=Replication",
  "tag.Context": "regionserver",
  "tag.Hostname": "mizar.local",
...
  "source.walReaderEditsBufferUsage": 684288,
...
}{noformat}
Some example updated output with a sleep-and-do-nothing replication endpoint. 
I'll put up a PR shortly and include this on GH too.

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-31 Thread Wellington Chevreuil (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168909#comment-17168909
 ] 

Wellington Chevreuil commented on HBASE-24779:
--

Had opened HBASE-24807 for the branch-1 backport.

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-31 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168890#comment-17168890
 ] 

Josh Elser commented on HBASE-24779:


{quote}Had done further investigations and found HBASE-20417, which has fixed 
the problem of wal reader running even with peer disabled for 2.1.0 onwards.
{quote}
Great! Thanks for searching and finding that. I didn't look hard enough, it 
seems :)
{quote}So we should backport HBASE-20417 to branch-1 as well.
{quote}
+1

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-31 Thread Wellington Chevreuil (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168660#comment-17168660
 ] 

Wellington Chevreuil commented on HBASE-24779:
--

Had done further investigations and found HBASE-20417, which has fixed the 
problem of wal reader running even with peer disabled for 2.1.0 onwards.

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-29 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167484#comment-17167484
 ] 

Josh Elser commented on HBASE-24779:


oh, crap. Right you are, [~bharathv]. I meant said it as 1.x versions I had 
been familiar with :)

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-29 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167479#comment-17167479
 ] 

Bharath Vissapragada commented on HBASE-24779:
--

Ouch, nice find. Perhaps we can expose the current buffer usage as a metric for 
observability? Also curious why this is 2.x only, seems HBASE-15995 is also in 
branch-1?

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-29 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167456#comment-17167456
 ] 

Josh Elser commented on HBASE-24779:


Thanks to Wellington – this one might be an HBase 2.x-only issue. Seems to have 
come about as a result of changes in HBASE-15995.

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-29 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167450#comment-17167450
 ] 

Josh Elser commented on HBASE-24779:


The fun continues for me. My current assumption is that we can actually hit 
this with a disabled peer. We don't prevent the wal reader threads from running 
for disabled peers. This means that, with enough backlog of data, a 
Regionserver could saturate the 256MB default limit and block non-disabled 
replication queues from running (because we won't read more from WALs for those 
non-disabled peers).

Just mentioning here, should someone else come along later. The plan will be to 
spin out a second Jira issue to make sure this can happen (e.g. via unit test) 
and fix it. This is specifically limited to reporting this replication quota to 
operators.

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-27 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165919#comment-17165919
 ] 

Josh Elser commented on HBASE-24779:


{quote}To confirm this suspicion, a restart of such stuck RSes had allowed 
replication to run smooth again?
{quote}
Yep.
{quote}did this use existing tooling? or something we could improve based off 
of e.g. the canary tool? at a minimum sounds like a good docs addition for the 
troubleshooting stuff in operator-tools or the ref guide
{quote}
Manual creation of a table and `assign`'ing it to RegionServers. We were not 
able to validate if "bad" / "good" regionservers actually matched the 
{{checkQuota()}} stack trace as I would've hoped.
{quote}
hbase 1-ish replication, hbase 2-ish replication, or hbase master-branch-ish 
replication?
{quote}
This user was actually on 2-ish, but going off of git-log, it seems like this 
logic has been here for some time. Surprising because I never ran into it 
before. Make me curious how the heck it even happened and suspect of my 
analysis to date.

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-27 Thread Sean Busbey (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165885#comment-17165885
 ] 

Sean Busbey commented on HBASE-24779:
-

{quote}Helped a customer this past weekend who, with a large number of 
RegionServers, has some RegionServers which replicated data to a peer without 
issues while other RegionServers did not.{quote}

hbase 1-ish replication, hbase 2-ish replication, or hbase master-branch-ish 
replication?

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-27 Thread Sean Busbey (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165883#comment-17165883
 ] 

Sean Busbey commented on HBASE-24779:
-

{quote}
We were able to validate that there were "good" and "bad" RegionServers by 
creating a test table, assigning it to a regionserver, enabling replication on 
that table, and validating if the local puts were replicated to a peer. On a 
good RS, data was replicated immediately. On a bad RS, data was never 
replicated (at least, on the order of 10's of minutes which we waited).
{quote}

did this use existing tooling? or something we could improve based off of e.g. 
the canary tool? at a minimum sounds like a good docs addition for the 
troubleshooting stuff in operator-tools or the ref guide

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-27 Thread Wellington Chevreuil (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165821#comment-17165821
 ] 

Wellington Chevreuil commented on HBASE-24779:
--

Interesting. To confirm this suspicion, a restart of such stuck RSes had 
allowed replication to run smooth again?

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-27 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165815#comment-17165815
 ] 

Josh Elser commented on HBASE-24779:


{noformat}
"regionserver/my-regionserver:16020.replicationSource.my-regionserver.domain.com%2C16020%2C1595762757628,.replicationSource.wal-reader.my-regionserver.domain.com%2C16020%2C1595762757628,"
 #531 daemon prio=5 os_prio=0 tid=0x7f0680d21000 nid=0x2c0eb 
sleeping[0x7ef3bf827000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:148)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.checkQuota(ReplicationSourceWALReader.java:227)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:130)

   Locked ownable synchronizers:
- None
 {noformat}
For example, this is one of the {{wal-reader}} threads which is "hung" in a 
loop, never actually reading entries off of the WALs. You may see one to many 
of these (many, if this RS is also reading recovered queues in addition to its 
active WAL)

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)