[jira] [Updated] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-08-07 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-24779:
---
Hadoop Flags: Reviewed
Release Note: New metrics are exposed, on the global source, for 
replication which indicate the "WAL entry buffer" that was introduced in 
HBASE-15995. When this usage reaches the limit, that RegionServer will cease to 
read more data for the sake of trying to replicate it. This usage (and limit) 
is local to each RegionServer is shared across all peers being handled by that 
RegionServer.
  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

Thanks Busbey, Wellington, and Bharath for the reviews on github!

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.4.0
>
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-08-07 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-24779:
---
Fix Version/s: 2.4.0
   3.0.0-alpha-1

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
> Fix For: 3.0.0-alpha-1, 2.4.0
>
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-08-03 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-24779:
---
Status: Patch Available  (was: Open)

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota

2020-07-27 Thread Sean Busbey (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-24779:

Component/s: Replication

> Improve insight into replication WAL readers hung on checkQuota
> ---
>
> Key: HBASE-24779
> URL: https://issues.apache.org/jira/browse/HBASE-24779
> Project: HBase
>  Issue Type: Task
>  Components: Replication
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Minor
>
> Helped a customer this past weekend who, with a large number of 
> RegionServers, has some RegionServers which replicated data to a peer without 
> issues while other RegionServers did not.
> The number of queue logs varied over the past 24hrs in the same manner. Some 
> spikes in queued logs into 100's of logs, but other times, only 1's-10's of 
> logs were queued.
> We were able to validate that there were "good" and "bad" RegionServers by 
> creating a test table, assigning it to a regionserver, enabling replication 
> on that table, and validating if the local puts were replicated to a peer. On 
> a good RS, data was replicated immediately. On a bad RS, data was never 
> replicated (at least, on the order of 10's of minutes which we waited).
> On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) 
> on that RS were spending time in a Thread.sleep() in a different location 
> than the other. Specifically it was sitting in the 
> {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the 
> {{handleEmptyWALBatch()}} method on the same class.
> My only assumption is that, somehow, these RegionServers got into a situation 
> where they "allocated" memory from the quota but never freed it. Then, 
> because the WAL reader thinks it has no free memory, it blocks indefinitely 
> and there are no pending edits to ship and (thus) free that memory. A cursory 
> glance at the code gives me a _lot_ of anxiety around places where we don't 
> properly clean it up (e.g. batches that fail to ship, dropping a peer). As a 
> first stab, let me add some more debugging so we can actually track this 
> state properly for the operators and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)