[jira] [Updated] (HBASE-20851) Change rubocop config for max line length of 100

2019-10-07 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-20851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-20851:
---
Fix Version/s: (was: 1.5.0)

> Change rubocop config for max line length of 100
> 
>
> Key: HBASE-20851
> URL: https://issues.apache.org/jira/browse/HBASE-20851
> Project: HBase
>  Issue Type: Bug
>  Components: community, shell
>Affects Versions: 2.0.1
>Reporter: Umesh Agashe
>Assignee: Murtaza Hassan
>Priority: Minor
>  Labels: beginner, beginners
> Fix For: 3.0.0, 2.2.0, 1.4.10, 2.3.0, 2.0.6, 2.1.5, 1.3.5
>
>
> Existing ruby and Java code uses max line length of 100 characters. Change 
> rubocop config with:
> {code:java}
> Metrics/LineLength:
>   Max: 100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-20672) New metrics ReadRequestRate and WriteRequestRate

2019-10-07 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-20672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-20672:
---
Fix Version/s: (was: 1.5.0)

> New metrics ReadRequestRate and WriteRequestRate
> 
>
> Key: HBASE-20672
> URL: https://issues.apache.org/jira/browse/HBASE-20672
> Project: HBase
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Ankit Jain
>Assignee: Ankit Jain
>Priority: Minor
> Fix For: 3.0.0, 1.3.3, 2.2.0, 1.4.10
>
> Attachments: HBASE-20672.branch-1.001.patch, 
> HBASE-20672.branch-1.002.patch, HBASE-20672.branch-2.001.patch, 
> HBASE-20672.master.001.patch, HBASE-20672.master.002.patch, 
> HBASE-20672.master.003.patch, hits1vs2.4.40.400.png
>
>
> Hbase currently provides counter read/write requests (ReadRequestCount, 
> WriteRequestCount). That said it is not easy to use counter that reset only 
> after a restart of the service, we would like to expose 2 new metrics in 
> HBase to provide ReadRequestRate and WriteRequestRate at region server level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-22215) Backport MultiRowRangeFilter does not work with reverse scans

2019-10-07 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22215:
---
Fix Version/s: (was: 1.5.0)

> Backport MultiRowRangeFilter does not work with reverse scans
> -
>
> Key: HBASE-22215
> URL: https://issues.apache.org/jira/browse/HBASE-22215
> Project: HBase
>  Issue Type: Sub-task
>  Components: Filters
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Fix For: 1.4.10
>
> Attachments: HBASE-22215.001.branch-1.patch, HBASE-22215.001.patch, 
> HBASE-22215.branch-1.4.001.patch
>
>
> See parent. Modify and apply to 1.x lines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-22219) Backport HBASE-19049 to branch-1 to prevent DIRKRB-613

2019-10-07 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22219:
---
Fix Version/s: (was: 1.5.0)

> Backport HBASE-19049 to branch-1 to prevent DIRKRB-613 
> ---
>
> Key: HBASE-22219
> URL: https://issues.apache.org/jira/browse/HBASE-22219
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 1.4.8, 1.4.7, 1.4.9
>Reporter: Yu Li
>Assignee: Yu Li
>Priority: Major
> Fix For: 1.4.10
>
> Attachments: HBASE-22219.branch-1.patch
>
>
> Observed several UT failures when verifying 1.5.0 release, one of which is as 
> follows:
> {noformat}
> [ERROR] org.apache.hadoop.hbase.http.TestSpnegoHttpServer  Time elapsed: 
> 0.005 s  <<< ERROR!
> java.lang.RuntimeException: Unable to parse:includedir /etc/krb5.conf.d/
> at 
> org.apache.hadoop.hbase.http.TestSpnegoHttpServer.buildMiniKdc(TestSpnegoHttpServer.java:143)
> at 
> org.apache.hadoop.hbase.http.TestSpnegoHttpServer.setupServer(TestSpnegoHttpServer.java:91)
> {noformat}
> And this is a known issue of kerby 1.0.0-RC2 as indicated by DIRKRB-613 and 
> fixed in 1.0.0 release (by [this 
> commit|https://github.com/apache/directory-kerby/commit/e34b1ef8fec64e89780aec37aac903d4608e215f])
> Thus we should backport HBASE-19049 which upgrade kerby dependency from 
> 1.0.0-RC2 to 1.0.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-22212) [1.x] Backport missing filter improvements

2019-10-07 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22212:
---
Fix Version/s: (was: 1.5.0)

> [1.x] Backport missing filter improvements
> --
>
> Key: HBASE-22212
> URL: https://issues.apache.org/jira/browse/HBASE-22212
> Project: HBase
>  Issue Type: Bug
>  Components: Filters
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Fix For: 1.4.10
>
> Attachments: HBASE-22212.001.branch-1.patch, 
> HBASE-22212.002.branch-1.patch
>
>
> HBASE-19008 and HBASE-21129 were never backported beyond branch-2. I can't 
> find any reason that this was not done. Despite these being public-tagged 
> classes, no incompatible changes were added.
> The lack of these changes prevents HBASE-22144 from being backported cleanly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-10-02 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942997#comment-16942997
 ] 

Andrew Purtell edited comment on HBASE-22988 at 10/2/19 5:17 PM:
-

I thought I resolved this. I guess I just closed the PR. 

I wasn't planning to apply this further back than branch-1 this but it does 
apply easily to 1.4, and 1.3 with a minor fixup. No objection, let me do that.


was (Author: apurtell):
I thought I resolved this. I guess I just closed the PR. 

I wasn't planning to backport this but it does apply easily to 1.4, and 1.3 
with a minor fixup. No objection, let me do that.

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-10-02 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942997#comment-16942997
 ] 

Andrew Purtell commented on HBASE-22988:


I thought I resolved this. I guess I just closed the PR. 

I wasn't planning to backport this but it does apply easily to 1.4, and 1.3 
with a minor fixup. No objection, let me do that.

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-15519) Add per-user metrics

2019-09-23 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936391#comment-16936391
 ] 

Andrew Purtell commented on HBASE-15519:


Point in time is suitable for the hbtop case and even for display in shell 
“status ‘detailed’” output, so that lines up. Thank you. Agree also the other 
issue is good for tracking the needed ClusterStatus changes. 

> Add per-user metrics 
> -
>
> Key: HBASE-15519
> URL: https://issues.apache.org/jira/browse/HBASE-15519
> Project: HBase
>  Issue Type: Sub-task
>  Components: metrics
>Affects Versions: 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Major
> Attachments: HBASE-15519.master.003.patch, hbase-15519_v0.patch, 
> hbase-15519_v1.patch, hbase-15519_v1.patch, hbase-15519_v2.patch
>
>
> Per-user metrics will be useful in multi-tenant cases where we can emit 
> number of requests, operations, num RPCs etc at the per-user aggregate level 
> per regionserver. We currently have throttles per user, but no way to monitor 
> resource usage per-user. 
> Looking at these metrics, operators can adjust throttles, do capacity 
> planning, etc per-user. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23065) [hbtop] Top-N heavy hitter user and client drill downs

2019-09-23 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-23065:
---
Description: 
After HBASE-15519, or after an additional change on top of it that provides 
necessary data in ClusterStatus, add drill down top-N views of activity 
aggregated per user or per client IP. Only a relatively small N of the heavy 
hitters need be tracked assuming this will be most useful when one or a handful 
of users or clients is generating problematic load and hbtop is invoked to 
learn their identity. 

This is a critical missing piece. After drilling down to find hot regions or 
tables, sometimes that is not enough, we also need to know which application or 
subset of clients out of many may be the source of the hot spotting load. 

  was:
After HBASE-15519, or after an additional change on top of it that provides 
necessary data in ClusterStatus, add drill down top-N views of activity 
aggregated per user or per client IP. Only a relatively small N of the heavy 
hitters need be tracked assuming this will be most useful when one or a handful 
of users or clients is generating problematic load and hbtop is invoked to 
learn their identity. 

This is a critical missing piece. After drilling down to find hot regions or 
tables, sometimes that is not enough, we also need to know which application 
out of many may be the source of the hot spotting load. 


> [hbtop] Top-N heavy hitter user and client drill downs
> --
>
> Key: HBASE-23065
> URL: https://issues.apache.org/jira/browse/HBASE-23065
> Project: HBase
>  Issue Type: Improvement
>Reporter: Andrew Purtell
>Priority: Major
>
> After HBASE-15519, or after an additional change on top of it that provides 
> necessary data in ClusterStatus, add drill down top-N views of activity 
> aggregated per user or per client IP. Only a relatively small N of the heavy 
> hitters need be tracked assuming this will be most useful when one or a 
> handful of users or clients is generating problematic load and hbtop is 
> invoked to learn their identity. 
> This is a critical missing piece. After drilling down to find hot regions or 
> tables, sometimes that is not enough, we also need to know which application 
> or subset of clients out of many may be the source of the hot spotting load. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23065) [hbtop] Top-N heavy hitter user and client drill downs

2019-09-23 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936261#comment-16936261
 ] 

Andrew Purtell commented on HBASE-23065:


/cc [~brfrn169]  [~busbey]  [~xucang]  [~lhofhansl] 

> [hbtop] Top-N heavy hitter user and client drill downs
> --
>
> Key: HBASE-23065
> URL: https://issues.apache.org/jira/browse/HBASE-23065
> Project: HBase
>  Issue Type: Improvement
>Reporter: Andrew Purtell
>Priority: Major
>
> After HBASE-15519, or after an additional change on top of it that provides 
> necessary data in ClusterStatus, add drill down top-N views of activity 
> aggregated per user or per client IP. Only a relatively small N of the heavy 
> hitters need be tracked assuming this will be most useful when one or a 
> handful of users or clients is generating problematic load and hbtop is 
> invoked to learn their identity. 
> This is a critical missing piece. After drilling down to find hot regions or 
> tables, sometimes that is not enough, we also need to know which application 
> out of many may be the source of the hot spotting load. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23065) [hbtop] Top-N heavy hitter user and client drill downs

2019-09-23 Thread Andrew Purtell (Jira)
Andrew Purtell created HBASE-23065:
--

 Summary: [hbtop] Top-N heavy hitter user and client drill downs
 Key: HBASE-23065
 URL: https://issues.apache.org/jira/browse/HBASE-23065
 Project: HBase
  Issue Type: Improvement
Reporter: Andrew Purtell


After HBASE-15519, or after an additional change on top of it that provides 
necessary data in ClusterStatus, add drill down top-N views of activity 
aggregated per user or per client IP. Only a relatively small N of the heavy 
hitters need be tracked assuming this will be most useful when one or a handful 
of users or clients is generating problematic load and hbtop is invoked to 
learn their identity. 

This is a critical missing piece. After drilling down to find hot regions or 
tables, sometimes that is not enough, we also need to know which application 
out of many may be the source of the hot spotting load. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23061) Replace use of Jackson for JSON serde in hbase common and client modules

2019-09-20 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934890#comment-16934890
 ] 

Andrew Purtell commented on HBASE-23061:


I think that issue has been recently modified because it wasn't clear that was 
proposed. At least to those who only have mobile device access to JIRA. :-) 
Feel free to close this as duplicate. 

> Replace use of Jackson for JSON serde in hbase common and client modules
> 
>
> Key: HBASE-23061
> URL: https://issues.apache.org/jira/browse/HBASE-23061
> Project: HBase
>  Issue Type: Bug
>Reporter: Andrew Purtell
>Priority: Blocker
> Fix For: 1.5.0
>
>
> We are using Jackson to emit JSON in at least one place in common and client. 
> We don't need all of Jackson and all the associated trouble just to do that. 
> Use a suitably licensed JSON library with no known vulnerability. This will 
> avoid problems downstream because we are trying to avoid having them pull in 
> a vulnerable Jackson via us so Jackson is a 'provided' scope transitive 
> dependency of client and its in-project dependencies (like common). 
> Here's where I am referring to:
> org.apache.hadoop.hbase.util.JsonMapper.(JsonMapper.java:37)
>at org.apache.hadoop.hbase.client.Operation.toJSON(Operation.java:70)
>at org.apache.hadoop.hbase.client.Operation.toString(Operation.java:96)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23061) Replace use of Jackson for JSON serde in hbase common and client modules

2019-09-20 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-23061:
---
Description: 
We are using Jackson to emit JSON in at least one place in common and client. 
We don't need all of Jackson and all the associated trouble just to do that. 
Use a suitably licensed JSON library with no known vulnerability. This will 
avoid problems downstream because we are trying to avoid having them pull in a 
vulnerable Jackson via us so Jackson is a 'provided' scope transitive 
dependency of client and its in-project dependencies (like common). 

Here's where I am referring to:

org.apache.hadoop.hbase.util.JsonMapper.(JsonMapper.java:37)
   at org.apache.hadoop.hbase.client.Operation.toJSON(Operation.java:70)
   at org.apache.hadoop.hbase.client.Operation.toString(Operation.java:96)

  was:
We are using Jackson to emit JSON in at least one place in common and client. 
We don't need all of Jackson and all the associated trouble just to do that. 
Use a suitably licensed JSON library with no known vulnerability. This will 
avoid problems downstream because we are trying to avoid having them pull in a 
vulnerable Jackson via us so Jackson is a provided scope. 

Here's where I am referring to:

org.apache.hadoop.hbase.util.JsonMapper.(JsonMapper.java:37)
   at org.apache.hadoop.hbase.client.Operation.toJSON(Operation.java:70)
   at org.apache.hadoop.hbase.client.Operation.toString(Operation.java:96)


> Replace use of Jackson for JSON serde in hbase common and client modules
> 
>
> Key: HBASE-23061
> URL: https://issues.apache.org/jira/browse/HBASE-23061
> Project: HBase
>  Issue Type: Bug
>Reporter: Andrew Purtell
>Priority: Blocker
> Fix For: 1.5.0
>
>
> We are using Jackson to emit JSON in at least one place in common and client. 
> We don't need all of Jackson and all the associated trouble just to do that. 
> Use a suitably licensed JSON library with no known vulnerability. This will 
> avoid problems downstream because we are trying to avoid having them pull in 
> a vulnerable Jackson via us so Jackson is a 'provided' scope transitive 
> dependency of client and its in-project dependencies (like common). 
> Here's where I am referring to:
> org.apache.hadoop.hbase.util.JsonMapper.(JsonMapper.java:37)
>at org.apache.hadoop.hbase.client.Operation.toJSON(Operation.java:70)
>at org.apache.hadoop.hbase.client.Operation.toString(Operation.java:96)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23061) Replace use of Jackson for JSON serde in hbase common and client modules

2019-09-20 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934874#comment-16934874
 ] 

Andrew Purtell commented on HBASE-23061:


Strongly related to HBASE-23052. This issue can cover the code changes to 
client and common where needed to use GSON from third-party

> Replace use of Jackson for JSON serde in hbase common and client modules
> 
>
> Key: HBASE-23061
> URL: https://issues.apache.org/jira/browse/HBASE-23061
> Project: HBase
>  Issue Type: Bug
>Reporter: Andrew Purtell
>Priority: Blocker
> Fix For: 1.5.0
>
>
> We are using Jackson to emit JSON in at least one place in common and client. 
> We don't need all of Jackson and all the associated trouble just to do that. 
> Use a suitably licensed JSON library with no known vulnerability. This will 
> avoid problems downstream because we are trying to avoid having them pull in 
> a vulnerable Jackson via us so Jackson is a provided scope. 
> Here's where I am referring to:
> org.apache.hadoop.hbase.util.JsonMapper.(JsonMapper.java:37)
>at org.apache.hadoop.hbase.client.Operation.toJSON(Operation.java:70)
>at org.apache.hadoop.hbase.client.Operation.toString(Operation.java:96)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23061) Replace use of Jackson for JSON serde in hbase common and client modules

2019-09-20 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934859#comment-16934859
 ] 

Andrew Purtell commented on HBASE-23061:


This is a blocker because in some circumstances a downstreamer will get a CNFE. 
Jackson is neither needed nor desired in common and client. Fix by replacement. 
We need to sort this out before releasing. Forward port the changes from 
branch-1 once committed. [~busbey] [~vjasani] [~lhofhansl] 

> Replace use of Jackson for JSON serde in hbase common and client modules
> 
>
> Key: HBASE-23061
> URL: https://issues.apache.org/jira/browse/HBASE-23061
> Project: HBase
>  Issue Type: Bug
>Reporter: Andrew Purtell
>Priority: Blocker
> Fix For: 1.5.0
>
>
> We are using Jackson to emit JSON in at least one place in common and client. 
> We don't need all of Jackson and all the associated trouble just to do that. 
> Use a suitably licensed JSON library with no known vulnerability. This will 
> avoid problems downstream because we are trying to avoid having them pull in 
> a vulnerable Jackson via us so Jackson is a provided scope. 
> Here's where I am referring to:
> org.apache.hadoop.hbase.util.JsonMapper.(JsonMapper.java:37)
>at org.apache.hadoop.hbase.client.Operation.toJSON(Operation.java:70)
>at org.apache.hadoop.hbase.client.Operation.toString(Operation.java:96)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23061) Replace use of Jackson for JSON serde in hbase common and client modules

2019-09-20 Thread Andrew Purtell (Jira)
Andrew Purtell created HBASE-23061:
--

 Summary: Replace use of Jackson for JSON serde in hbase common and 
client modules
 Key: HBASE-23061
 URL: https://issues.apache.org/jira/browse/HBASE-23061
 Project: HBase
  Issue Type: Bug
Reporter: Andrew Purtell
 Fix For: 1.5.0


We are using Jackson to emit JSON in at least one place in common and client. 
We don't need all of Jackson and all the associated trouble just to do that. 
Use a suitably licensed JSON library with no known vulnerability. This will 
avoid problems downstream because we are trying to avoid having them pull in a 
vulnerable Jackson via us so Jackson is a provided scope. 

Here's where I am referring to:

org.apache.hadoop.hbase.util.JsonMapper.(JsonMapper.java:37)
   at org.apache.hadoop.hbase.client.Operation.toJSON(Operation.java:70)
   at org.apache.hadoop.hbase.client.Operation.toString(Operation.java:96)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23015) branch-1 hbase-server, testing util, and shaded testing util need jackson

2019-09-20 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934836#comment-16934836
 ] 

Andrew Purtell commented on HBASE-23015:


Yes I will try it again when back if it is not already underway but anyone 
should feel free to RM should they get the itch. Go ahead with my blessing

> branch-1 hbase-server, testing util, and  shaded testing util need jackson
> --
>
> Key: HBASE-23015
> URL: https://issues.apache.org/jira/browse/HBASE-23015
> Project: HBase
>  Issue Type: Bug
>  Components: Client, shading
>Affects Versions: 1.5.0, 1.3.6, 1.4.11
>Reporter: Sean Busbey
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-23015.branch-1.3.000.patch, 
> HBASE-23015.branch-1.3.001.patch
>
>
> HBASE-22728 moved out jackson transitive dependencies. mostly good, but 
> moving jackson2 to provided in hbase-server broke few things
> testing-util needs a transitive jackson 2 in order to start the minicluster, 
> currently fails with CNFE for {{com.fasterxml.jackson.databind.ObjectMapper}} 
> when trying to initialize the master.
> shaded-testing-util needs a relocated jackson 2 for the same reason
> it's not used for any of the mapreduce stuff in hbase-server, so 
> {{hbase-shaded-server}} for that purpose should be fine. But it is used by 
> {{WALPrettyPrinter}} and some folks might expect that to work from that 
> artifact since it is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-15519) Add per-user metrics

2019-09-20 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934257#comment-16934257
 ] 

Andrew Purtell commented on HBASE-15519:


Pardon the potentially ignorant question. I'm on vacation with just a phone. 
Does this work include exporting at least a subset of per user accouting in 
ClusterStatus? It would be fine to do this as a follow up of course. Reason I 
ask is hbtop receives its telemetry from ClusterStatus and an obvious and 
important enhancement for that tool would be a display mode for top N user or 
top N client activity by server or region (or table or namespace aggregates). 

> Add per-user metrics 
> -
>
> Key: HBASE-15519
> URL: https://issues.apache.org/jira/browse/HBASE-15519
> Project: HBase
>  Issue Type: Sub-task
>  Components: metrics
>Affects Versions: 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Major
> Attachments: HBASE-15519.master.003.patch, hbase-15519_v0.patch, 
> hbase-15519_v1.patch, hbase-15519_v1.patch, hbase-15519_v2.patch
>
>
> Per-user metrics will be useful in multi-tenant cases where we can emit 
> number of requests, operations, num RPCs etc at the per-user aggregate level 
> per regionserver. We currently have throttles per user, but no way to monitor 
> resource usage per-user. 
> Looking at these metrics, operators can adjust throttles, do capacity 
> planning, etc per-user. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-20 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934210#comment-16934210
 ] 

Andrew Purtell commented on HBASE-22988:


Please don't release 1.5.0 without this. It is about ready to go, just pending 
a review cycle for Java 7 port. [~lhofhansl]  [~busbey] 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23015) branch-1 hbase-server, testing util, and shaded testing util need jackson

2019-09-19 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933842#comment-16933842
 ] 

Andrew Purtell commented on HBASE-23015:


[~lhofhansl] filed a Phoenix JIRA that indicates they are blocked by this when 
trying to use head of branch-1 for testing. I am on vacation without a dev 
laptop unfortunately , could someone please commit this?

> branch-1 hbase-server, testing util, and  shaded testing util need jackson
> --
>
> Key: HBASE-23015
> URL: https://issues.apache.org/jira/browse/HBASE-23015
> Project: HBase
>  Issue Type: Bug
>  Components: Client, shading
>Affects Versions: 1.5.0, 1.3.6, 1.4.11
>Reporter: Sean Busbey
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-23015.branch-1.3.000.patch
>
>
> HBASE-22728 moved out jackson transitive dependencies. mostly good, but 
> moving jackson2 to provided in hbase-server broke few things
> testing-util needs a transitive jackson 2 in order to start the minicluster, 
> currently fails with CNFE for {{com.fasterxml.jackson.databind.ObjectMapper}} 
> when trying to initialize the master.
> shaded-testing-util needs a relocated jackson 2 for the same reason
> it's not used for any of the mapreduce stuff in hbase-server, so 
> {{hbase-shaded-server}} for that purpose should be fine. But it is used by 
> {{WALPrettyPrinter}} and some folks might expect that to work from that 
> artifact since it is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-21856) Consider Causal Replication Ordering

2019-09-19 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933196#comment-16933196
 ] 

Andrew Purtell edited comment on HBASE-21856 at 9/19/19 9:23 AM:
-

The sender is going to be providing batches in sequence to each consistent 
target. These batches may arrive out of order but the window will be small by 
definition because RPCs are in flight when the ordering can be indeterminate. 

But if we are having issues as described if we introduce a back pressure signal 
or retransmit indication that stops the sender, or causes it to (exponentially) 
back off, or causes it to resend a specific batch, this could be explored - but 
only if actually needed


was (Author: apurtell):
That's correct. But if we are having issues as described if we introduce a back 
pressure signal that stops the sender, or causes it to (exponentially) back 
off, this will be fine. 

> Consider Causal Replication Ordering
> 
>
> Key: HBASE-21856
> URL: https://issues.apache.org/jira/browse/HBASE-21856
> Project: HBase
>  Issue Type: Brainstorming
>  Components: Replication
>Reporter: Lars Hofhansl
>Priority: Major
>  Labels: Replication
>
> We've had various efforts to improve the ordering guarantees for HBase 
> replication, most notably Serial Replication.
> I think in many cases guaranteeing a Total Replication Order is not required, 
> but a simpler Causal Replication Order is sufficient.
> Specifically we would guarantee causal ordering for a single Rowkey. Any 
> changes to a Row - Puts, Deletes, etc - would be replicated in the exact 
> order in which they occurred in the source system.
> Unlike total ordering this can be accomplished with only local region server 
> control.
> I don't have a full design in mind, let's discuss here. It should be 
> sufficient to to the following:
> # RegionServers only adopt the replication queues from other RegionServers 
> for regions they (now) own. This requires log splitting for replication.
> # RegionServers ship all edits for queues adopted from other servers before 
> any of their "own" edits are shipped.
> It's probably a bit more involved, but should be much cheaper that the total 
> ordering provided by serial replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-21856) Consider Causal Replication Ordering

2019-09-19 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933196#comment-16933196
 ] 

Andrew Purtell commented on HBASE-21856:


That's correct. But if we are having issues as described if we introduce a back 
pressure signal that stops the sender, or causes it to (exponentially) back 
off, this will be fine. 

> Consider Causal Replication Ordering
> 
>
> Key: HBASE-21856
> URL: https://issues.apache.org/jira/browse/HBASE-21856
> Project: HBase
>  Issue Type: Brainstorming
>  Components: Replication
>Reporter: Lars Hofhansl
>Priority: Major
>  Labels: Replication
>
> We've had various efforts to improve the ordering guarantees for HBase 
> replication, most notably Serial Replication.
> I think in many cases guaranteeing a Total Replication Order is not required, 
> but a simpler Causal Replication Order is sufficient.
> Specifically we would guarantee causal ordering for a single Rowkey. Any 
> changes to a Row - Puts, Deletes, etc - would be replicated in the exact 
> order in which they occurred in the source system.
> Unlike total ordering this can be accomplished with only local region server 
> control.
> I don't have a full design in mind, let's discuss here. It should be 
> sufficient to to the following:
> # RegionServers only adopt the replication queues from other RegionServers 
> for regions they (now) own. This requires log splitting for replication.
> # RegionServers ship all edits for queues adopted from other servers before 
> any of their "own" edits are shipped.
> It's probably a bit more involved, but should be much cheaper that the total 
> ordering provided by serial replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-18 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932179#comment-16932179
 ] 

Andrew Purtell commented on HBASE-22978:


If need be but seems premature because the work hasn’t even started yet. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 

[jira] [Assigned] (HBASE-22978) Online slow response log

2019-09-18 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reassigned HBASE-22978:
--

Assignee: Andrew Purtell

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 
> a user name, will include only entries pertaining to that user. Filters are 

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-16 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930926#comment-16930926
 ] 

Andrew Purtell commented on HBASE-22978:


Just to be clear if the HDFS file option is turned on then all history will be 
there even though it may be expensive to keep all the events - large file sizes 
even with compression, depending. The shell cannot be expected to process large 
data efficiently. Some kind of tool that can optionally use MR will be 
developed. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only 

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-16 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930921#comment-16930921
 ] 

Andrew Purtell commented on HBASE-22978:


Shell is ring buffers. This is an online slow request log. The goal isn’t to 
have complete history it is to keep a list of interesting events for as long as 
we can without consuming too many resources either RAM or files. 

Having an option also to dump to HDFS is important so we added that as a side 
goal. The tool that dumps the files will have the same capabilities as the 
shell because I expect to share common utility classes. However this is not the 
main goal. The main goal is shell access to ring buffers containing *latest* 
events of interest, for use in online and active debugging. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of 

[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-11 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928068#comment-16928068
 ] 

Andrew Purtell commented on HBASE-22988:


Not sure your addendum changes are included here. I used your github branch as 
source. 

bq. Do we need to convert the patch to for java7? If so, I can do it.

It would make adoption easier. Right now we must conditionally include the 
module only if building with java 8 and there are issues with precommit for 
this configuration. If it were all Java 7 code then we can include the module 
unconditionally. The issues with the unit tests would also be resolved. Let’s 
do this if you have the time and interest. 

I will be back from vacation in two weeks. Happy to review or help out when 
back if needed. 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-23000) Fix all consistently failing tests in branch-1.3

2019-09-10 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927102#comment-16927102
 ] 

Andrew Purtell commented on HBASE-23000:


Thanks. Let’s patch branch-1.3 to fix this. 

> Fix all consistently failing tests in branch-1.3
> 
>
> Key: HBASE-23000
> URL: https://issues.apache.org/jira/browse/HBASE-23000
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.3.6
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>Priority: Major
>
> Flaky test report: 
> https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-1.3/Flaky_20Test_20Report/dashboard.html#job_2
> In last 30 builds this test failed all 30 times.
> Here is the stack trace: 
> {noformat}
> Stacktrace
> java.io.IOException: Shutting down
>   at 
> org.apache.hadoop.hbase.fs.TestBlockReorder.testBlockLocation(TestBlockReorder.java:428)
> Caused by: java.lang.RuntimeException: Master not initialized after 20ms 
> seconds
>   at 
> org.apache.hadoop.hbase.fs.TestBlockReorder.testBlockLocation(TestBlockReorder.java:428)
> {noformat}
> Link to latest jenkins build: 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-1.3/9351/testReport/org.apache.hadoop.hbase.fs/TestBlockReorder/testBlockLocation/



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-21856) Consider Causal Replication Ordering

2019-09-10 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926815#comment-16926815
 ] 

Andrew Purtell commented on HBASE-21856:


The *_source_* must ship all regions to the same sink regionserver. That is the 
narrowest requirement on the source side. All edits for a region must flow to 
the same sink, presumably selected with a consistent hash. The *_sink_* 
regionserver must then apply the batches in order. It is not strictly necessary 
for the source to send batches sequentially or in a blocking manner. We might 
opt to implement it that way, but it is not necessary. IMHO, better to do the 
mastering and ordering on the sink side, but agree it may make implementation 
more challenging. 

> Consider Causal Replication Ordering
> 
>
> Key: HBASE-21856
> URL: https://issues.apache.org/jira/browse/HBASE-21856
> Project: HBase
>  Issue Type: Brainstorming
>  Components: Replication
>Reporter: Lars Hofhansl
>Priority: Major
>  Labels: Replication
>
> We've had various efforts to improve the ordering guarantees for HBase 
> replication, most notably Serial Replication.
> I think in many cases guaranteeing a Total Replication Order is not required, 
> but a simpler Causal Replication Order is sufficient.
> Specifically we would guarantee causal ordering for a single Rowkey. Any 
> changes to a Row - Puts, Deletes, etc - would be replicated in the exact 
> order in which they occurred in the source system.
> Unlike total ordering this can be accomplished with only local region server 
> control.
> I don't have a full design in mind, let's discuss here. It should be 
> sufficient to to the following:
> # RegionServers only adopt the replication queues from other RegionServers 
> for regions they (now) own. This requires log splitting for replication.
> # RegionServers ship all edits for queues adopted from other servers before 
> any of their "own" edits are shipped.
> It's probably a bit more involved, but should be much cheaper that the total 
> ordering provided by serial replication.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22760) Stop/Resume Snapshot Auto-Cleanup activity with shell command

2019-09-10 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926812#comment-16926812
 ] 

Andrew Purtell commented on HBASE-22760:


+1


> Stop/Resume Snapshot Auto-Cleanup activity with shell command
> -
>
> Key: HBASE-22760
> URL: https://issues.apache.org/jira/browse/HBASE-22760
> Project: HBase
>  Issue Type: Improvement
>  Components: Admin, shell, snapshots
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0
>
> Attachments: HBASE-22760.branch-1.000.patch, 
> HBASE-22760.branch-1.001.patch, HBASE-22760.branch-2.000.patch, 
> HBASE-22760.master.003.patch, HBASE-22760.master.004.patch, 
> HBASE-22760.master.005.patch, HBASE-22760.master.008.patch, 
> HBASE-22760.master.009.patch
>
>
> For any scheduled snapshot backup activity, we would like to disable 
> auto-cleaner for snapshot based on TTL. However, as per HBASE-22648 we have a 
> config to disable snapshot auto-cleaner: 
> hbase.master.cleaner.snapshot.disable, which would take effect only upon 
> HMaster restart just similar to any other hbase-site configs.
> For any running cluster, we should be able to stop/resume auto-cleanup 
> activity for snapshot based on shell command. Something similar to below 
> command should be able to stop/start cleanup chore:
> hbase(main):001:0> snapshot_auto_cleanup_switch false    (disable 
> auto-cleaner)
> hbase(main):001:0> snapshot_auto_cleanup_switch true     (enable auto-cleaner)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22460) Reopen a region if store reader references may have leaked

2019-09-10 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926810#comment-16926810
 ] 

Andrew Purtell commented on HBASE-22460:


I'm going out on vacation for two weeks> [~vjasani] you'll probably get more 
review interaction if you open a github PR with your master patch. Refer to 
this JIRA in the PR and everything will link up. 

> Reopen a region if store reader references may have leaked
> --
>
> Key: HBASE-22460
> URL: https://issues.apache.org/jira/browse/HBASE-22460
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: Andrew Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.3.0
>
> Attachments: HBASE-22460.master.000.patch
>
>
> We can leak store reader references if a coprocessor or core function somehow 
> opens a scanner, or wraps one, and then does not take care to call close on 
> the scanner or the wrapped instance. A reasonable mitigation for a reader 
> reference leak would be a fast reopen of the region on the same server 
> (initiated by the RS) This will release all resources, like the refcount, 
> leases, etc. The clients should gracefully ride over this like any other 
> region transition. This reopen would be like what is done during schema 
> change application and ideally would reuse the relevant code. If the refcount 
> is over some ridiculous threshold this mitigation could be triggered along 
> with a fat WARN in the logs. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22460) Reopen a region if store reader references may have leaked

2019-09-10 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926811#comment-16926811
 ] 

Andrew Purtell commented on HBASE-22460:


Never mind I just refreshed this issue and now see the link to 
https://github.com/apache/hbase/pull/600 

> Reopen a region if store reader references may have leaked
> --
>
> Key: HBASE-22460
> URL: https://issues.apache.org/jira/browse/HBASE-22460
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: Andrew Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.3.0
>
> Attachments: HBASE-22460.master.000.patch
>
>
> We can leak store reader references if a coprocessor or core function somehow 
> opens a scanner, or wraps one, and then does not take care to call close on 
> the scanner or the wrapped instance. A reasonable mitigation for a reader 
> reference leak would be a fast reopen of the region on the same server 
> (initiated by the RS) This will release all resources, like the refcount, 
> leases, etc. The clients should gracefully ride over this like any other 
> region transition. This reopen would be like what is done during schema 
> change application and ideally would reuse the relevant code. If the refcount 
> is over some ridiculous threshold this mitigation could be triggered along 
> with a fat WARN in the logs. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-23000) Fix all consistently failing tests in branch-1.3

2019-09-10 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926808#comment-16926808
 ] 

Andrew Purtell commented on HBASE-23000:


Ok, well HBASE-22627 is a critical bug fix and cannot be reverted, so we shall 
need to fix what it broke. I had different test results when porting 
HBASE-22627. Not doubting the results here, but wondering if more is going on. 

> Fix all consistently failing tests in branch-1.3
> 
>
> Key: HBASE-23000
> URL: https://issues.apache.org/jira/browse/HBASE-23000
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.3.6
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>Priority: Major
>
> Flaky test report: 
> https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-1.3/Flaky_20Test_20Report/dashboard.html#job_2
> In last 30 builds this test failed all 30 times.
> Here is the stack trace: 
> {noformat}
> Stacktrace
> java.io.IOException: Shutting down
>   at 
> org.apache.hadoop.hbase.fs.TestBlockReorder.testBlockLocation(TestBlockReorder.java:428)
> Caused by: java.lang.RuntimeException: Master not initialized after 20ms 
> seconds
>   at 
> org.apache.hadoop.hbase.fs.TestBlockReorder.testBlockLocation(TestBlockReorder.java:428)
> {noformat}
> Link to latest jenkins build: 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-1.3/9351/testReport/org.apache.hadoop.hbase.fs/TestBlockReorder/testBlockLocation/



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22627) Port HBASE-22617 (Recovered WAL directories not getting cleaned up) to branch-1

2019-09-10 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22627:
---
Priority: Blocker  (was: Major)

> Port HBASE-22617 (Recovered WAL directories not getting cleaned up) to 
> branch-1
> ---
>
> Key: HBASE-22627
> URL: https://issues.apache.org/jira/browse/HBASE-22627
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 1.5.0, 1.4.10, 1.3.5
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Blocker
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-22627-branch-1.patch, HBASE-22627-branch-1.patch, 
> HBASE-22627-branch-1.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work started] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-22988 started by Andrew Purtell.
--
> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22988:
---
Attachment: HBASE-22988-branch-1.patch

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926274#comment-16926274
 ] 

Andrew Purtell commented on HBASE-22988:


Updated patch. Just a nit fix in {{bin/hbase}}

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22988:
---
Attachment: (was: HBASE-22988-branch-1.patch)

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926250#comment-16926250
 ] 

Andrew Purtell edited comment on HBASE-22988 at 9/10/19 2:21 AM:
-

Dropping a WIP patch.

This can't be committed or go through precommit until the Yetus issue is 
addressed. Please don't set to Patch Available status.

Manual tests check out. Every mode seems to work as expected. All unit tests 
pass locally. On that note, the mockito version used by branch-1 does not 
support matchers implemented as Java 8 lambdas, so affected lines of some units 
are commented out. Bumping mockito is risky given chance of collateral damage, 
probably should rewrite the affected tests. 


was (Author: apurtell):
Dropping a WIP patch.

This can't be committed or go through precommit until the Yetus issue is 
addressed. Please don't set to Patch Available status.

Manual tests check out. Every mode seems to work as expected. All unit tests 
pass locally. On that note, one remaining thing to do is some parts of some 
unit tests are currently commented out because the mockito and/or hamcrest 
version used by branch-1 does not support matchers implemented as Java 8 
lambdas. Bumping mockito is risky given chance of collateral damage, probably 
should rewrite the affected tests. 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926264#comment-16926264
 ] 

Andrew Purtell commented on HBASE-22988:


A significant difference from trunk is there is no filtered reads metric 
available in ClusterStatus, so I dropped that field, which affected some layout 
tests, and those were updated too. 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22988:
---
Fix Version/s: 1.5.0

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22988:
---
Attachment: HBASE-22988-branch-1.patch

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22988:
---
Attachment: (was: HBASE-22988-branch-1.patch)

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926250#comment-16926250
 ] 

Andrew Purtell commented on HBASE-22988:


Dropping a WIP patch.

This can't be committed or go through precommit until the Yetus issue is 
addressed. Please don't set to Patch Available status.

Manual tests check out. Every mode seems to work as expected. All unit tests 
pass locally. On that note, one remaining thing to do is some parts of some 
unit tests are currently commented out because the mockito and/or hamcrest 
version used by branch-1 does not support matchers implemented as Java 8 
lambdas. Bumping mockito is risky given chance of collateral damage, probably 
should rewrite the affected tests. 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22988:
---
Attachment: HBASE-22988-branch-1.patch

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
> Attachments: HBASE-22988-branch-1.patch
>
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (HBASE-22902) At regionserver start there's a request to roll the WAL

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reassigned HBASE-22902:
--

Assignee: (was: Andrew Purtell)

> At regionserver start there's a request to roll the WAL
> ---
>
> Key: HBASE-22902
> URL: https://issues.apache.org/jira/browse/HBASE-22902
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: David Manning
>Priority: Minor
>
> See HBASE-22301 for logic that requests to roll the WAL if regionserver 
> encounters a slow write pipeline. In the logs, during regionserver start, I 
> see that the WAL is requested to roll once. It's strange that we roll the WAL 
> because it wasn't a slow sync. It appears when this code executes, we haven't 
> initialized the {{rollOnSyncNs}} variable to use for determining whether it's 
> a slow sync. Current pipeline also shows empty in the logs.
> Disclaimer: I'm experiencing this after backporting this to 1.3.x and 
> building it there - I haven't attempted in 1.5.x, though I'd expect similar 
> results.
> Regionserver logs follow (notice *threshold=0 ms, current pipeline: []*):
> {noformat}
> Tue Aug 20 23:29:50 GMT 2019 Starting regionserver
> ...
> 2019-08-20 23:29:57,824 INFO  wal.FSHLog - WAL configuration: blocksize=256 
> MB, rollsize=243.20 MB, prefix=[truncated]%2C1566343792434, suffix=, 
> logDir=hdfs://[truncated]/hbase/WALs/[truncated],1566343792434, 
> archiveDir=hdfs://[truncated]/hbase/oldWALs
> 2019-08-20 23:29:58,104 INFO  wal.FSHLog - Slow sync cost: 186 ms, current 
> pipeline: []
> 2019-08-20 23:29:58,104 WARN  wal.FSHLog - Requesting log roll because we 
> exceeded slow sync threshold; time=186 ms, threshold=0 ms, current pipeline: 
> []
> 2019-08-20 23:29:58,107 DEBUG regionserver.ReplicationSourceManager - Start 
> tracking logs for wal group [truncated]%2C1566343792434 for peer 1
> 2019-08-20 23:29:58,107 INFO  wal.FSHLog - New WAL 
> /hbase/WALs/[truncated],1566343792434/[truncated]%2C1566343792434.1566343797824
> 2019-08-20 23:29:58,109 DEBUG regionserver.ReplicationSource - Starting up 
> worker for wal group [truncated]%2C1566343792434{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22902) At regionserver start there's a request to roll the WAL

2019-09-09 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925931#comment-16925931
 ] 

Andrew Purtell commented on HBASE-22902:


Going out on vacation for much of September. Unassigning, in case someone else 
is interested in picking it up in the meantime. I'll assign back to myself and 
do it upon return otherwise. 

> At regionserver start there's a request to roll the WAL
> ---
>
> Key: HBASE-22902
> URL: https://issues.apache.org/jira/browse/HBASE-22902
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: David Manning
>Priority: Minor
>
> See HBASE-22301 for logic that requests to roll the WAL if regionserver 
> encounters a slow write pipeline. In the logs, during regionserver start, I 
> see that the WAL is requested to roll once. It's strange that we roll the WAL 
> because it wasn't a slow sync. It appears when this code executes, we haven't 
> initialized the {{rollOnSyncNs}} variable to use for determining whether it's 
> a slow sync. Current pipeline also shows empty in the logs.
> Disclaimer: I'm experiencing this after backporting this to 1.3.x and 
> building it there - I haven't attempted in 1.5.x, though I'd expect similar 
> results.
> Regionserver logs follow (notice *threshold=0 ms, current pipeline: []*):
> {noformat}
> Tue Aug 20 23:29:50 GMT 2019 Starting regionserver
> ...
> 2019-08-20 23:29:57,824 INFO  wal.FSHLog - WAL configuration: blocksize=256 
> MB, rollsize=243.20 MB, prefix=[truncated]%2C1566343792434, suffix=, 
> logDir=hdfs://[truncated]/hbase/WALs/[truncated],1566343792434, 
> archiveDir=hdfs://[truncated]/hbase/oldWALs
> 2019-08-20 23:29:58,104 INFO  wal.FSHLog - Slow sync cost: 186 ms, current 
> pipeline: []
> 2019-08-20 23:29:58,104 WARN  wal.FSHLog - Requesting log roll because we 
> exceeded slow sync threshold; time=186 ms, threshold=0 ms, current pipeline: 
> []
> 2019-08-20 23:29:58,107 DEBUG regionserver.ReplicationSourceManager - Start 
> tracking logs for wal group [truncated]%2C1566343792434 for peer 1
> 2019-08-20 23:29:58,107 INFO  wal.FSHLog - New WAL 
> /hbase/WALs/[truncated],1566343792434/[truncated]%2C1566343792434.1566343797824
> 2019-08-20 23:29:58,109 DEBUG regionserver.ReplicationSource - Starting up 
> worker for wal group [truncated]%2C1566343792434{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-09 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925929#comment-16925929
 ] 

Andrew Purtell commented on HBASE-22978:


Going out on vacation for much of September. Unassigning, in case someone else 
is interested in picking it up in the meantime. I'll assign back to myself and 
do it upon return otherwise. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> 

[jira] [Assigned] (HBASE-22978) Online slow response log

2019-09-09 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reassigned HBASE-22978:
--

Assignee: (was: Andrew Purtell)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 
> a user name, will include only entries pertaining to that user. Filters are 
> additive, for example if 

[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-07 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925030#comment-16925030
 ] 

Andrew Purtell commented on HBASE-22988:


We won't be able to commit this or even pass it through precommit until Yetus 
or our project config can handle a branch-1 Java 7 build with optional modules 
for Java 8. It's what broke precommit before and stalled the backport of 
TinyLFU. /cc [~busbey] 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-07 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925031#comment-16925031
 ] 

Andrew Purtell commented on HBASE-22988:


That said I will post the patch here anyway. 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-07 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925028#comment-16925028
 ] 

Andrew Purtell commented on HBASE-22988:


I have a patch for this. It is against a local version that also includes 
TinyLFU. Let me clean it up for branch-1 and post it Monday. 

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (HBASE-22988) Backport HBASE-11062 "hbtop" to branch-1

2019-09-07 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reassigned HBASE-22988:
--

Assignee: Andrew Purtell  (was: Toshihiro Suzuki)

> Backport HBASE-11062 "hbtop" to branch-1
> 
>
> Key: HBASE-22988
> URL: https://issues.apache.org/jira/browse/HBASE-22988
> Project: HBase
>  Issue Type: Sub-task
>  Components: backport, hbtop
>Reporter: Toshihiro Suzuki
>Assignee: Andrew Purtell
>Priority: Major
>
> Backport parent issue to branch-1.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923852#comment-16923852
 ] 

Andrew Purtell edited comment on HBASE-22978 at 9/6/19 1:19 AM:


You can invoke the shell with a command and pipe the output to a file, like 
{{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped 
to something else. Nothing special need be done there. 

I like the idea of persisting the complete slow log with best effort, though. 
Reminds me of the [MySQL slow log dump 
tool|https://dev.mysql.com/doc/refman/8.0/en/mysqldumpslow.html]. Initially I 
was thinking this could be part of general performance surveillance, where 
sampling is good enough, but maybe there could be a tough to debug case that's 
rare so would also be hard to catch in the ring buffers. For that we'd 
configure a slow log directory in site configuration, presumably in HDFS, into 
which regionservers would each append to a file, rolling at some configured 
bound. A tool that decodes and prints to stdout, like HFilePrettyPrinter and 
such, can mostly share common code with what we put in the regionserver to do 
the same out to RPC for the shell. 


was (Author: apurtell):
You can invoke the shell with a command and pipe the output to a file, like 
{{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped 
to something else. Nothing special need be done there. 

I like the idea of persisting the complete slow log with best effort, though. 
Reminds me of the [MySQL slow 
log|https://dev.mysql.com/doc/refman/8.0/en/slow-query-log.html]. Initially I 
was thinking this could be part of general performance surveillance, where 
sampling is good enough, but maybe there could be a tough to debug case that's 
rare so would also be hard to catch in the ring buffers. For that we'd 
configure a slow log directory in site configuration, presumably in HDFS, into 
which regionservers would each append to a file, rolling at some configured 
bound. A tool that decodes and prints to stdout, like HFilePrettyPrinter and 
such, can mostly share common code with what we put in the regionserver to do 
the same out to RPC for the shell. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total 

[jira] [Commented] (HBASE-22760) Stop/Resume Snapshot Auto-Cleanup activity with shell command

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923854#comment-16923854
 ] 

Andrew Purtell commented on HBASE-22760:


[~vjasani] I can't commit this yet. TestShell fails on branch-1
{noformat}
  1) Error:
test_snapshot_auto_cleanup_should_work(Hbase::AdminMethodsTest):
NoMethodError: undefined method `command' for 
#
./src/test/ruby/hbase/admin_test.rb:163:in 
`test_snapshot_auto_cleanup_should_work'
org/jruby/RubyProc.java:270:in `call'
org/jruby/RubyKernel.java:2105:in `send'
org/jruby/RubyArray.java:1620:in `each'
org/jruby/RubyArray.java:1620:in `each'

422 tests, 593 assertions, 0 failures, 1 errors
{noformat}


> Stop/Resume Snapshot Auto-Cleanup activity with shell command
> -
>
> Key: HBASE-22760
> URL: https://issues.apache.org/jira/browse/HBASE-22760
> Project: HBase
>  Issue Type: Improvement
>  Components: Admin, shell, snapshots
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0
>
> Attachments: HBASE-22760.branch-1.000.patch, 
> HBASE-22760.branch-2.000.patch, HBASE-22760.master.003.patch, 
> HBASE-22760.master.004.patch, HBASE-22760.master.005.patch, 
> HBASE-22760.master.008.patch, HBASE-22760.master.009.patch
>
>
> For any scheduled snapshot backup activity, we would like to disable 
> auto-cleaner for snapshot based on TTL. However, as per HBASE-22648 we have a 
> config to disable snapshot auto-cleaner: 
> hbase.master.cleaner.snapshot.disable, which would take effect only upon 
> HMaster restart just similar to any other hbase-site configs.
> For any running cluster, we should be able to stop/resume auto-cleanup 
> activity for snapshot based on shell command. Something similar to below 
> command should be able to stop/start cleanup chore:
> hbase(main):001:0> snapshot_auto_cleanup_switch false    (disable 
> auto-cleaner)
> hbase(main):001:0> snapshot_auto_cleanup_switch true     (enable auto-cleaner)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22760) Stop/Resume Snapshot Auto-Cleanup activity with shell command

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22760:
---
Fix Version/s: (was: 2.2.2)
   (was: 2.1.7)
   (was: 1.4.11)

> Stop/Resume Snapshot Auto-Cleanup activity with shell command
> -
>
> Key: HBASE-22760
> URL: https://issues.apache.org/jira/browse/HBASE-22760
> Project: HBase
>  Issue Type: Improvement
>  Components: Admin, shell, snapshots
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0
>
> Attachments: HBASE-22760.branch-1.000.patch, 
> HBASE-22760.branch-2.000.patch, HBASE-22760.master.003.patch, 
> HBASE-22760.master.004.patch, HBASE-22760.master.005.patch, 
> HBASE-22760.master.008.patch, HBASE-22760.master.009.patch
>
>
> For any scheduled snapshot backup activity, we would like to disable 
> auto-cleaner for snapshot based on TTL. However, as per HBASE-22648 we have a 
> config to disable snapshot auto-cleaner: 
> hbase.master.cleaner.snapshot.disable, which would take effect only upon 
> HMaster restart just similar to any other hbase-site configs.
> For any running cluster, we should be able to stop/resume auto-cleanup 
> activity for snapshot based on shell command. Something similar to below 
> command should be able to stop/start cleanup chore:
> hbase(main):001:0> snapshot_auto_cleanup_switch false    (disable 
> auto-cleaner)
> hbase(main):001:0> snapshot_auto_cleanup_switch true     (enable auto-cleaner)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923852#comment-16923852
 ] 

Andrew Purtell commented on HBASE-22978:


You can invoke the shell with a command and pipe the output to a file, like 
{{./bin/hbase shell 'command' > output.txt 2>&1}} . Or the output can be piped 
to something else. Nothing special need be done there. 

I like the idea of persisting the complete slow log with best effort, though. 
Reminds me of the [MySQL slow 
log|https://dev.mysql.com/doc/refman/8.0/en/slow-query-log.html]. Initially I 
was thinking this could be part of general performance surveillance, where 
sampling is good enough, but maybe there could be a tough to debug case that's 
rare so would also be hard to catch in the ring buffers. For that we'd 
configure a slow log directory in site configuration, presumably in HDFS, into 
which regionservers would each append to a file, rolling at some configured 
bound. A tool that decodes and prints to stdout, like HFilePrettyPrinter and 
such, can mostly share common code with what we put in the regionserver to do 
the same out to RPC for the shell. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923772#comment-16923772
 ] 

Andrew Purtell edited comment on HBASE-22978 at 9/5/19 10:37 PM:
-

Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

Pseudocode of this command would be like:
{noformat}
servers = nil
filters = ...
if filters['REGION'] 
servers = locateRegion( filters['REGION'] )
elsif filters['TABLE']
servers = locateRegions( filters['TABLE'] )
results = @admin.getSlowResponses(servers).map { 
// if a returned slow response entry matches any filter, return the 
value, else nil
}.compact // remove any nils
// pretty print 'results'
{noformat}


was (Author: apurtell):
Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

Pseudocode of this command would be like:
{noformat}
servers = []
filters = ...
if filters['REGION'] 
servers = locateRegion( filters['REGION'] )
elsif filters['TABLE']
servers = locateRegions( filters['TABLE'] )
results = @admin.getSlowResponses(servers.empty? ? nil : servers).map { 
// if a returned slow response entry matches any filter, return the 
value, else nil
}.compact // remove any nils
// pretty print 'results'
{noformat}

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. Provide a tool for parsing, dumping, filtering, and pretty 
printing the slow logs written to HDFS. The tool and the shell can share and 
reuse some utility classes and methods for accomplishing that. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. Filters are additive, for example 
if both CLIENT_IP and USER filters are given, entries matching either or both 
conditions will be included. The exception to this is if both TABLE and REGION 
filters are provided, REGION will be preferred and TABLE will be ignored. A 
server name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A 

[jira] [Assigned] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reassigned HBASE-22978:
--

Assignee: Andrew Purtell

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 
> a user name, will include only entries pertaining to that user. Filters are 
> additive, for example if both CLIENT_IP and USER filters are given, entries 
> matching either or both conditions will be included. The exception to this is 
> if both TABLE and REGION filters are provided, REGION 

[jira] [Comment Edited] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923772#comment-16923772
 ] 

Andrew Purtell edited comment on HBASE-22978 at 9/5/19 10:25 PM:
-

Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

Pseudocode of this command would be like:
{noformat}
servers = []
filters = ...
if filters['REGION'] 
servers = locateRegion( filters['REGION'] )
elsif filters['TABLE']
servers = locateRegions( filters['TABLE'] )
results = @admin.getSlowResponses(servers.empty? ? nil : servers).map { 
// if a returned slow response entry matches any filter, return the 
value, else nil
}.compact // remove any nils
// pretty print 'results'
{noformat}


was (Author: apurtell):
Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as 

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923772#comment-16923772
 ] 

Andrew Purtell commented on HBASE-22978:


Description updated. Admin API simplified. The {{get_slow_responses}} shell 
command is reconsidered as a general query tool. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 
> a user name, will include only entries pertaining to that user. Filters are 
> additive, for example if both CLIENT_IP and USER filters are given, entries 
> matching either or both conditions will be 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. Filters are additive, for example 
if both CLIENT_IP and USER filters are given, entries matching either or both 
conditions will be included. The exception to this is if both TABLE and REGION 
filters are provided, REGION will be preferred and TABLE will be ignored. A 
server name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. If both TABLE and REGION filters 
are provided, REGION will be used. A server name is its host, port, and start 
code, e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. A server name is its host, port, 
and start code, e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. A server name is its host, port, 
and start code, e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 

[jira] [Commented] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923760#comment-16923760
 ] 

Andrew Purtell commented on HBASE-22978:


I'll update the description to include write-behind of the ring buffer to a 
directory in HDFS, but it shouldn't block, so if we stall during writing some 
ring buffer entries may have been lost. If we can detect that we can log that 
it happened in the file. 

bq. Also saving the user and/or client IP of the request and being able to ask 
for requests by those would be extra nice

Ok, will include user and client IP in the request details set aside. 

As for query APIs, the admin API is sugar over fan out requests to 
regionservers for whatever is currently sitting in the ring buffers. Where we 
want to narrow the search by region or table we can get region locations and 
prune the regionserver set. Filtering or sorting on other attributes would be 
done locally in the client. I think it best to let the client index the list of 
ResponseDetail however it likes. 

The shell commands are one client of the admin APIs. This seems a good place to 
put additional convenience filtering. Will update the description for this too. 

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, regionserver, shell
>Reporter: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> —
> New shell commands:
> {{get_slow_responses  [ , \{ SERVERS=> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer. Provide a table name as first argument to find all regions and 
> retrieve too slow response entries for the given table from all servers 
> currently hosting it. Provide a region name as first argument to retrieve all 
> too slow response entries for the 

[jira] [Updated] (HBASE-22937) The RawBytesComparator in branch-1 have wrong comparison order

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22937:
---
Fix Version/s: 1.3.6

> The RawBytesComparator in branch-1 have wrong comparison order
> --
>
> Key: HBASE-22937
> URL: https://issues.apache.org/jira/browse/HBASE-22937
> Project: HBase
>  Issue Type: Bug
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
>
> When digging the HBASE-22862, we found a bug in 
> RawBytesComparator#compareOnlyKeyPortion  (although it's unrelated to the 
> corruption in HBASE-22862). 
> {code}
> @Override
> @VisibleForTesting
> public int compareOnlyKeyPortion(Cell left, Cell right) {
> // ...
>   return (0xff & left.getTypeByte()) - (0xff & right.getTypeByte());
> }
> {code}
> Here should be (0xff & right.getTypeByte()) - (0xff & left.getTypeByte())  I 
> think.
> I can see the BloomFilter or HFile v2 are still using the comparator in 
> branch-1 (but not in branch-2). Maybe we can just remove the class (if some 
> HFile encoded with this comparator, then mapping to the correct KVComparator 
> just like 2.x), or fix the bug in current RawBytesComparator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses  [ , \{ SERVERS=> } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer. Provide a table name as first argument to find all regions and retrieve 
too slow response entries for the given table from all servers currently 
hosting it. Provide a region name as first argument to retrieve all too slow 
response entries for the given region. Optionally provide a map of parameters 
as second argument. The SERVERS parameter, which expects a list of server 
names, will constrain the search to the given set of servers. A server name is 
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

{{get_slow_responses [  ... ,  ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers on the cluster 
if no argument is provided. A server name is its host, port, and start code, 
e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan","param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. (All of these things have been reported from 
various production settings. Truncation was added after we crashed a user's log 
capture system.) The truncation is unfortunate because it eliminates much of 
the utility of the warnings. For example, the region name, the start and end 
keys, and the filter hierarchy are all important clues for debugging 
performance problems caused by moderate to low selectivity queries or queries 
made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses  [ , \{ SERVERS=> } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer. Provide a table name as first argument to find all regions and retrieve 
too slow response entries for the given table from all servers currently 
hosting it. Provide a region name as first argument to retrieve all too slow 
response entries for the given region. Optionally provide a map of parameters 
as second argument. The SERVERS parameter, which expects a list of server 
names, will constrain the search to the given set of servers. A server name is 
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

{{get_slow_responses [  ... ,  ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers on the cluster 
if no argument is provided. A server name is its host, port, and start code, 
e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan","param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}

{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. (All of these things have been reported from 
various production settings. Truncation was added after we crashed a user's log 
capture system.) The truncation is unfortunate because it eliminates much of 
the utility of the warnings. For example, the region name, the start and end 
keys, and the filter hierarchy are all important clues for debugging 
performance problems caused by moderate to low selectivity queries or queries 
made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses  [ , \{ SERVERS=> } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer. Provide a table name as first argument to find all regions and retrieve 
too slow response entries for the given table from all servers currently 
hosting it. Provide a region name as first argument to retrieve all too slow 
response entries for the given region. Optionally provide a map of parameters 
as second argument. The SERVERS parameter, which expects a list of server 
names, will constrain the search to the given set of servers. A server name is 
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

{{get_slow_responses [  ... ,  ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers on the cluster 
if no argument is provided. A server name is its host, port, and start code, 
e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] 

[jira] [Created] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)
Andrew Purtell created HBASE-22978:
--

 Summary: Online slow response log
 Key: HBASE-22978
 URL: https://issues.apache.org/jira/browse/HBASE-22978
 Project: HBase
  Issue Type: New Feature
  Components: Admin, regionserver, shell
Reporter: Andrew Purtell


Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow): 
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","starttimems":1567203007549,"responsesize":6819737,"method":"Scan","param":"region
 { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000","processingtimems":28646,"client":"10.253.196.215:41116","queuetimems":22453,"class":"HRegionServer"}

{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. (All of these things have been reported from 
various production settings. Truncation was added after we crashed a user's log 
capture system.) The truncation is unfortunate because it eliminates much of 
the utility of the warnings. For example, the region name, the start and end 
keys, and the filter hierarchy are all important clues for debugging 
performance problems caused by moderate to low selectivity queries or queries 
made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses  [ , \{ SERVERS=> } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer. Provide a table name as first argument to find all regions and retrieve 
too slow response entries for the given table from all servers currently 
hosting it. Provide a region name as first argument to retrieve all too slow 
response entries for the given region. Optionally provide a map of parameters 
as second argument. The SERVERS parameter, which expects a list of server 
names, will constrain the search to the given set of servers. A server name is 
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

{{get_slow_responses [  ... ,  ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers on the cluster 
if no argument is provided. A server name is its host, port, and start code, 
e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22760) Stop/Resume Snapshot Auto-Cleanup activity with shell command

2019-09-05 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923719#comment-16923719
 ] 

Andrew Purtell commented on HBASE-22760:


Thanks. Will try to commit this today. If those checkstyle warns in the 
branch-1 report are related I'll fix on commit. 

> Stop/Resume Snapshot Auto-Cleanup activity with shell command
> -
>
> Key: HBASE-22760
> URL: https://issues.apache.org/jira/browse/HBASE-22760
> Project: HBase
>  Issue Type: Improvement
>  Components: Admin, shell, snapshots
>Affects Versions: 3.0.0, 1.5.0, 2.3.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0, 1.4.11, 2.1.7, 2.2.2
>
> Attachments: HBASE-22760.branch-1.000.patch, 
> HBASE-22760.branch-2.000.patch, HBASE-22760.master.003.patch, 
> HBASE-22760.master.004.patch, HBASE-22760.master.005.patch, 
> HBASE-22760.master.008.patch, HBASE-22760.master.009.patch
>
>
> For any scheduled snapshot backup activity, we would like to disable 
> auto-cleaner for snapshot based on TTL. However, as per HBASE-22648 we have a 
> config to disable snapshot auto-cleaner: 
> hbase.master.cleaner.snapshot.disable, which would take effect only upon 
> HMaster restart just similar to any other hbase-site configs.
> For any running cluster, we should be able to stop/resume auto-cleanup 
> activity for snapshot based on shell command. Something similar to below 
> command should be able to stop/start cleanup chore:
> hbase(main):001:0> snapshot_auto_cleanup_switch false    (disable 
> auto-cleaner)
> hbase(main):001:0> snapshot_auto_cleanup_switch true     (enable auto-cleaner)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22760) Stop/Resume Snapshot Auto-Cleanup activity with shell command

2019-08-28 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918004#comment-16918004
 ] 

Andrew Purtell commented on HBASE-22760:


+1

> Stop/Resume Snapshot Auto-Cleanup activity with shell command
> -
>
> Key: HBASE-22760
> URL: https://issues.apache.org/jira/browse/HBASE-22760
> Project: HBase
>  Issue Type: Improvement
>  Components: Admin, shell, snapshots
>Affects Versions: 3.0.0, 1.5.0, 2.3.0, 2.2.1, 1.4.11
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0, 2.2.1, 1.4.11
>
> Attachments: HBASE-22760.master.003.patch, 
> HBASE-22760.master.004.patch, HBASE-22760.master.005.patch, 
> HBASE-22760.master.008.patch, HBASE-22760.master.009.patch
>
>
> For any scheduled snapshot backup activity, we would like to disable 
> auto-cleaner for snapshot based on TTL. However, as per HBASE-22648 we have a 
> config to disable snapshot auto-cleaner: 
> hbase.master.cleaner.snapshot.disable, which would take effect only upon 
> HMaster restart just similar to any other hbase-site configs.
> For any running cluster, we should be able to stop/resume auto-cleanup 
> activity for snapshot based on shell command. Something similar to below 
> command should be able to stop/start cleanup chore:
> hbase(main):001:0> snapshot_auto_cleanup_switch false    (disable 
> auto-cleaner)
> hbase(main):001:0> snapshot_auto_cleanup_switch true     (enable auto-cleaner)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22380) break circle replication when doing bulkload

2019-08-28 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918003#comment-16918003
 ] 

Andrew Purtell commented on HBASE-22380:


A few days is fine. It's not looking good this week or early next either, a few 
other things, and the labor day holiday in the US. 

> break circle replication when doing bulkload
> 
>
> Key: HBASE-22380
> URL: https://issues.apache.org/jira/browse/HBASE-22380
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.0.5, 2.3.0, 2.1.5, 1.3.5
>Reporter: chenxu
>Assignee: Wellington Chevreuil
>Priority: Critical
>  Labels: bulkload
> Fix For: 3.0.0, 1.5.0, 2.3.0, 1.4.11, 2.1.7, 2.2.2
>
>
> when enabled master-master bulkload replication, HFiles will be replicated 
> circularly between two clusters



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22935) TaskMonitor warns MonitoredRPCHandler task may be stuck when it recently started

2019-08-28 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22935:
---
Fix Version/s: 1.4.11
   1.3.6
   2.1.6
   2.2.1
   2.3.0
   1.5.0
   3.0.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> TaskMonitor warns MonitoredRPCHandler task may be stuck when it recently 
> started
> 
>
> Key: HBASE-22935
> URL: https://issues.apache.org/jira/browse/HBASE-22935
> Project: HBase
>  Issue Type: Bug
>  Components: logging
>Affects Versions: 3.0.0, 1.4.0, 1.5.0, 1.3.3, 2.0.0
>Reporter: David Manning
>Assignee: David Manning
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.3.0, 2.2.1, 2.1.6, 1.3.6, 1.4.11
>
> Attachments: HBASE-22935.master.001.patch
>
>
> After setting {{hbase.taskmonitor.rpc.warn.time}} to 18, the logs show 
> WARN messages such as these
> {noformat}
> 2019-08-08 21:50:02,601 WARN  [read for TaskMonitor] monitoring.TaskMonitor - 
> Task may be stuck: RpcServer.FifoWFPBQ.default.handler=4,queue=4,port=60020: 
> status=Servicing call from :55164: Scan, state=RUNNING, 
> startTime=1563305858103, completionTime=-1, queuetimems=1565301002599, 
> starttimems=1565301002599, clientaddress=, remoteport=55164, 
> packetlength=370, rpcMethod=Scan
> {noformat}
> Notice that the first {{starttimems}} is far in the past. The second 
> {{starttimems}} and the {{queuetimems}} are much closer to the log timestamp 
> than 180 seconds. I think this is because the warnTime is initialized to the 
> time that MonitoredTaskImpl is created, but never updated until we write a 
> warn message to the log.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (HBASE-22862) Region Server crash with: Added a key not lexically larger than previous

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917379#comment-16917379
 ] 

Andrew Purtell edited comment on HBASE-22862 at 8/28/19 3:52 AM:
-

To clarify. The method you highlight has a visible for testing annotation. 
Whether or not the class is used is different from whether or not this method 
is used for more than just tests. Or if there are more issues than just this 
method. 


was (Author: apurtell):
To clarify. The method you highlight has a visible for testing annotation. 
Whether or not the class is used is different from whether or not this method 
is used for more than just tests. 

> Region Server crash with: Added a key not lexically larger than previous
> 
>
> Key: HBASE-22862
> URL: https://issues.apache.org/jira/browse/HBASE-22862
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 1.4.10
> Environment: {code}
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (Zulu 8.31.0.1-linux64) (build 1.8.0_181-b02)
> OpenJDK 64-Bit Server VM (Zulu 8.31.0.1-linux64) (build 25.181-b02, mixed 
> mode)
> {code}
>Reporter: Alex Batyrshin
>Assignee: Zheng Hu
>Priority: Critical
> Attachments: HBASE-22862.UT.v01.patch, HBASE-22862.UT.v02.patch
>
>
> We observe error "Added a key not lexically larger than previous” that cause 
> most of our region-servers to crash in our cluster.
> {code}
> 2019-08-15 18:02:10,554 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
> Flushing 1/1 column families, memstore=56.08 MB
> 2019-08-15 18:02:10,727 WARN  [MemStoreFlusher.0] regionserver.HStore: Failed 
> flushing store file, retrying num=0
> java.io.IOException: Added a key not lexically larger than previous. Current 
> cell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567,
>  lastCell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770
>at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139)
>at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003)
>at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
>at java.lang.Thread.run(Thread.java:748)
> 2019-08-15 18:02:21,776 WARN  [MemStoreFlusher.0] regionserver.HStore: Failed 
> flushing store file, retrying num=9
> java.io.IOException: Added a key not lexically larger than previous. Current 
> cell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567,
>  lastCell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770
>at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139)
>at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>at 
> 

[jira] [Commented] (HBASE-22862) Region Server crash with: Added a key not lexically larger than previous

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917379#comment-16917379
 ] 

Andrew Purtell commented on HBASE-22862:


To clarify. The method you highlight has a visible for testing annotation. 
Whether or not the class is used is different from whether or not this method 
is used for more than just tests. 

> Region Server crash with: Added a key not lexically larger than previous
> 
>
> Key: HBASE-22862
> URL: https://issues.apache.org/jira/browse/HBASE-22862
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 1.4.10
> Environment: {code}
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (Zulu 8.31.0.1-linux64) (build 1.8.0_181-b02)
> OpenJDK 64-Bit Server VM (Zulu 8.31.0.1-linux64) (build 25.181-b02, mixed 
> mode)
> {code}
>Reporter: Alex Batyrshin
>Assignee: Zheng Hu
>Priority: Critical
> Attachments: HBASE-22862.UT.v01.patch, HBASE-22862.UT.v02.patch
>
>
> We observe error "Added a key not lexically larger than previous” that cause 
> most of our region-servers to crash in our cluster.
> {code}
> 2019-08-15 18:02:10,554 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
> Flushing 1/1 column families, memstore=56.08 MB
> 2019-08-15 18:02:10,727 WARN  [MemStoreFlusher.0] regionserver.HStore: Failed 
> flushing store file, retrying num=0
> java.io.IOException: Added a key not lexically larger than previous. Current 
> cell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567,
>  lastCell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770
>at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139)
>at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003)
>at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
>at java.lang.Thread.run(Thread.java:748)
> 2019-08-15 18:02:21,776 WARN  [MemStoreFlusher.0] regionserver.HStore: Failed 
> flushing store file, retrying num=9
> java.io.IOException: Added a key not lexically larger than previous. Current 
> cell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567,
>  lastCell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770
>at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139)
>at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003)
>at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622)
>at 
> 

[jira] [Commented] (HBASE-11062) hbtop

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-11062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917316#comment-16917316
 ] 

Andrew Purtell commented on HBASE-11062:


Fixed the issue with the branch-1 backport, formatting is better. Was my error 
somewhere. 

I see a div by zero issue when using ClusterStatus instead of ClusterMetrics, 
maybe this minor change should be considered:
{code}
diff --git 
a/hbase-hbtop/src/main/java/org/apache/hadoop/hbase/hbtop/mode/RequestCountPerSecond.java
 
b/hbase-hbtop/src/main/java/org/apache/hadoop/hbase/hbtop/mode/RequestCountPerSecond.java
index 27625b9a33..5fd3453cf9 100644
--- 
a/hbase-hbtop/src/main/java/org/apache/hadoop/hbase/hbtop/mode/RequestCountPerSecond.java
+++ 
b/hbase-hbtop/src/main/java/org/apache/hadoop/hbase/hbtop/mode/RequestCountPerSecond.java
@@ -37,10 +37,12 @@ public class RequestCountPerSecond {
   previousReadRequestCount = readRequestCount;
   previousWriteRequestCount = writeRequestCount;
 } else if (previousLastReportTimestamp != lastReportTimestamp) {
-  readRequestCountPerSecond = (readRequestCount - 
previousReadRequestCount) /
-((lastReportTimestamp - previousLastReportTimestamp) / 1000);
-  writeRequestCountPerSecond = (writeRequestCount - 
previousWriteRequestCount) /
-((lastReportTimestamp - previousLastReportTimestamp) / 1000);
+  long delta = (lastReportTimestamp - previousLastReportTimestamp) / 1000;
+  if (delta < 1) {
+delta = 1;
+  }
+  readRequestCountPerSecond = (readRequestCount - 
previousReadRequestCount) / delta;
+  writeRequestCountPerSecond = (writeRequestCount - 
previousWriteRequestCount) / delta;
 
   previousLastReportTimestamp = lastReportTimestamp;
   previousReadRequestCount = readRequestCount;
{code}

> hbtop
> -
>
> Key: HBASE-11062
> URL: https://issues.apache.org/jira/browse/HBASE-11062
> Project: HBase
>  Issue Type: New Feature
>  Components: hbtop
>Reporter: Andrew Purtell
>Assignee: Toshihiro Suzuki
>Priority: Major
>
> A top-like monitor could be useful for testing, debugging, operations of 
> clusters of moderate size, and possibly for diagnosing issues in large 
> clusters.
> Consider a curses interface like the one presented by atop 
> (http://www.atoptool.nl/images/screenshots/genericw.png) - with aggregate 
> metrics collected over a monitoring interval in the upper portion of the 
> pane, and a listing of discrete measurements sorted and filtered by various 
> criteria in the bottom part of the pane. One might imagine a cluster overview 
> with cluster aggregate metrics above and a list of regionservers sorted by 
> utilization below; and a regionserver view with process metrics above and a 
> list of metrics by operation type below, or a list of client connections, or 
> a list of threads, sorted by utilization, throughput, or latency. 
> Generically 'htop' is taken but would be distinctive in the HBase context, a 
> utility org.apache.hadoop.hbase.HTop
> No need necessarily for a curses interface. Could be an external monitor with 
> a web front end as has been discussed before. I do like the idea of a process 
> that runs in a terminal because I interact with dev and test HBase clusters 
> exclusively by SSH. 
> UPDATE:
> The tool name is changed from htop to hbtop.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22935) TaskMonitor warns MonitoredRPCHandler task may be stuck when it recently started

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917303#comment-16917303
 ] 

Andrew Purtell commented on HBASE-22935:


+1

> TaskMonitor warns MonitoredRPCHandler task may be stuck when it recently 
> started
> 
>
> Key: HBASE-22935
> URL: https://issues.apache.org/jira/browse/HBASE-22935
> Project: HBase
>  Issue Type: Bug
>  Components: logging
>Affects Versions: 3.0.0, 1.4.0, 1.5.0, 1.3.3, 2.0.0
>Reporter: David Manning
>Assignee: David Manning
>Priority: Minor
> Attachments: HBASE-22935.master.001.patch
>
>
> After setting {{hbase.taskmonitor.rpc.warn.time}} to 18, the logs show 
> WARN messages such as these
> {noformat}
> 2019-08-08 21:50:02,601 WARN  [read for TaskMonitor] monitoring.TaskMonitor - 
> Task may be stuck: RpcServer.FifoWFPBQ.default.handler=4,queue=4,port=60020: 
> status=Servicing call from :55164: Scan, state=RUNNING, 
> startTime=1563305858103, completionTime=-1, queuetimems=1565301002599, 
> starttimems=1565301002599, clientaddress=, remoteport=55164, 
> packetlength=370, rpcMethod=Scan
> {noformat}
> Notice that the first {{starttimems}} is far in the past. The second 
> {{starttimems}} and the {{queuetimems}} are much closer to the log timestamp 
> than 180 seconds. I think this is because the warnTime is initialized to the 
> time that MonitoredTaskImpl is created, but never updated until we write a 
> warn message to the log.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-11062) hbtop

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-11062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917251#comment-16917251
 ] 

Andrew Purtell commented on HBASE-11062:


I'm having trouble with the backport when using the mac terminal app. Unit 
tests pass, but hbtop aborts. The backport isn't the issue, please ignore that, 
I commented on the PR about the concerns around supporting a custom terminal 
library. 

> hbtop
> -
>
> Key: HBASE-11062
> URL: https://issues.apache.org/jira/browse/HBASE-11062
> Project: HBase
>  Issue Type: New Feature
>  Components: hbtop
>Reporter: Andrew Purtell
>Assignee: Toshihiro Suzuki
>Priority: Major
>
> A top-like monitor could be useful for testing, debugging, operations of 
> clusters of moderate size, and possibly for diagnosing issues in large 
> clusters.
> Consider a curses interface like the one presented by atop 
> (http://www.atoptool.nl/images/screenshots/genericw.png) - with aggregate 
> metrics collected over a monitoring interval in the upper portion of the 
> pane, and a listing of discrete measurements sorted and filtered by various 
> criteria in the bottom part of the pane. One might imagine a cluster overview 
> with cluster aggregate metrics above and a list of regionservers sorted by 
> utilization below; and a regionserver view with process metrics above and a 
> list of metrics by operation type below, or a list of client connections, or 
> a list of threads, sorted by utilization, throughput, or latency. 
> Generically 'htop' is taken but would be distinctive in the HBase context, a 
> utility org.apache.hadoop.hbase.HTop
> No need necessarily for a curses interface. Could be an external monitor with 
> a web front end as has been discussed before. I do like the idea of a process 
> that runs in a terminal because I interact with dev and test HBase clusters 
> exclusively by SSH. 
> UPDATE:
> The tool name is changed from htop to hbtop.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22806) Deleted CF are not cleared if memstore contain entries

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917196#comment-16917196
 ] 

Andrew Purtell commented on HBASE-22806:


bq. To fix this problem in branch-1, we need to  add region reopen completion 
wait logic or disable-enable the table.

Neither option sounds great. Adding the wait-for-completion seems less 
disruptive. We can try that. [~pankaj2461]

> Deleted CF are not cleared if memstore contain entries
> --
>
> Key: HBASE-22806
> URL: https://issues.apache.org/jira/browse/HBASE-22806
> Project: HBase
>  Issue Type: Bug
>  Components: API
>Affects Versions: 2.1.3
> Environment: Scala
> HBase Java Client
> Mac/Linux
>Reporter: Chao
>Assignee: Pankaj Kumar
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 2.2.1, 2.1.6
>
> Attachments: HBASE-22806_UT.patch
>
>
> While deleting the CF dynamically (without disabling the table), CF dirs are 
> not cleared from FS when region memstore contain entries for that CF. 
> Since we delete the CF from FS first and then reopen the region, during 
> reopen RS will flush the memstore content to FS. So deleted CF store will 
> contain the memstore content for the deleted CF. 
> So adding back same CF will have old entries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (HBASE-22909) align hbase-vote script across all current releases

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-22909.

  Assignee: (was: Andrew Purtell)
Resolution: Not A Problem

> align hbase-vote script across all current releases
> ---
>
> Key: HBASE-22909
> URL: https://issues.apache.org/jira/browse/HBASE-22909
> Project: HBase
>  Issue Type: Task
>  Components: build, community
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.3.0, 2.1.5, 1.3.5
>Reporter: Artem Ervits
>Priority: Minor
>
> hbase-vote script is in different state across all of the current releases. 
> Now that https://issues.apache.org/jira/browse/HBASE-22464 is merged, this 
> Jira is to converge all releases on one version of hbase-vote.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22909) align hbase-vote script across all current releases

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22909:
---
Fix Version/s: (was: 1.4.11)
   (was: 1.3.6)
   (was: 2.1.6)
   (was: 2.2.1)
   (was: 2.3.0)
   (was: 1.5.0)

> align hbase-vote script across all current releases
> ---
>
> Key: HBASE-22909
> URL: https://issues.apache.org/jira/browse/HBASE-22909
> Project: HBase
>  Issue Type: Task
>  Components: build, community
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.3.0, 2.1.5, 1.3.5
>Reporter: Artem Ervits
>Assignee: Andrew Purtell
>Priority: Minor
>
> hbase-vote script is in different state across all of the current releases. 
> Now that https://issues.apache.org/jira/browse/HBASE-22464 is merged, this 
> Jira is to converge all releases on one version of hbase-vote.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22909) align hbase-vote script across all current releases

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917193#comment-16917193
 ] 

Andrew Purtell commented on HBASE-22909:


There is no work to do here. git says the alignment has already taken place. 
Resolving as Not A Problem. 

> align hbase-vote script across all current releases
> ---
>
> Key: HBASE-22909
> URL: https://issues.apache.org/jira/browse/HBASE-22909
> Project: HBase
>  Issue Type: Task
>  Components: build, community
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.3.0, 2.1.5, 1.3.5
>Reporter: Artem Ervits
>Assignee: Andrew Purtell
>Priority: Minor
> Fix For: 1.5.0, 2.3.0, 2.2.1, 2.1.6, 1.3.6, 1.4.11
>
>
> hbase-vote script is in different state across all of the current releases. 
> Now that https://issues.apache.org/jira/browse/HBASE-22464 is merged, this 
> Jira is to converge all releases on one version of hbase-vote.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22909) align hbase-vote script across all current releases

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22909:
---
Fix Version/s: 1.4.11
   1.3.6
   2.1.6
   2.2.1
   2.3.0
   1.5.0

> align hbase-vote script across all current releases
> ---
>
> Key: HBASE-22909
> URL: https://issues.apache.org/jira/browse/HBASE-22909
> Project: HBase
>  Issue Type: Task
>  Components: build, community
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.3.0, 2.1.5, 1.3.5
>Reporter: Artem Ervits
>Assignee: Andrew Purtell
>Priority: Minor
> Fix For: 1.5.0, 2.3.0, 2.2.1, 2.1.6, 1.3.6, 1.4.11
>
>
> hbase-vote script is in different state across all of the current releases. 
> Now that https://issues.apache.org/jira/browse/HBASE-22464 is merged, this 
> Jira is to converge all releases on one version of hbase-vote.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22909) align hbase-vote script across all current releases

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917190#comment-16917190
 ] 

Andrew Purtell commented on HBASE-22909:


I'll pick the master version back to all branches and close this. Going forward 
yeah we should document that the master version will receive updates first. 

> align hbase-vote script across all current releases
> ---
>
> Key: HBASE-22909
> URL: https://issues.apache.org/jira/browse/HBASE-22909
> Project: HBase
>  Issue Type: Task
>  Components: build, community
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.3.0, 2.1.5, 1.3.5
>Reporter: Artem Ervits
>Priority: Minor
>
> hbase-vote script is in different state across all of the current releases. 
> Now that https://issues.apache.org/jira/browse/HBASE-22464 is merged, this 
> Jira is to converge all releases on one version of hbase-vote.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (HBASE-22909) align hbase-vote script across all current releases

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell reassigned HBASE-22909:
--

Assignee: Andrew Purtell

> align hbase-vote script across all current releases
> ---
>
> Key: HBASE-22909
> URL: https://issues.apache.org/jira/browse/HBASE-22909
> Project: HBase
>  Issue Type: Task
>  Components: build, community
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.3.0, 2.1.5, 1.3.5
>Reporter: Artem Ervits
>Assignee: Andrew Purtell
>Priority: Minor
>
> hbase-vote script is in different state across all of the current releases. 
> Now that https://issues.apache.org/jira/browse/HBASE-22464 is merged, this 
> Jira is to converge all releases on one version of hbase-vote.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22760) Stop/Resume Snapshot Auto-Cleanup activity with shell command

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917182#comment-16917182
 ] 

Andrew Purtell commented on HBASE-22760:


009 patch lgtm

Is this an unintentional change?
{code}
diff --git a/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java 
b/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
index dff15db68e..a999e73b02 100644
--- a/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
+++ b/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
@@ -1471,8 +1471,6 @@ public final class HConstants {
   // User defined Default TTL config key
   public static final String DEFAULT_SNAPSHOT_TTL_CONFIG_KEY = 
"hbase.master.snapshot.ttl";
 
-  public static final String SNAPSHOT_CLEANER_DISABLE = 
"hbase.master.cleaner.snapshot.disable";
-
   /**
* Configurations for master executor services.
*/
{code}


> Stop/Resume Snapshot Auto-Cleanup activity with shell command
> -
>
> Key: HBASE-22760
> URL: https://issues.apache.org/jira/browse/HBASE-22760
> Project: HBase
>  Issue Type: Improvement
>  Components: Admin, shell, snapshots
>Affects Versions: 3.0.0, 1.5.0, 2.3.0, 2.2.1, 1.4.11
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0, 2.2.1, 1.4.11
>
> Attachments: HBASE-22760.master.003.patch, 
> HBASE-22760.master.004.patch, HBASE-22760.master.005.patch, 
> HBASE-22760.master.008.patch, HBASE-22760.master.009.patch
>
>
> For any scheduled snapshot backup activity, we would like to disable 
> auto-cleaner for snapshot based on TTL. However, as per HBASE-22648 we have a 
> config to disable snapshot auto-cleaner: 
> hbase.master.cleaner.snapshot.disable, which would take effect only upon 
> HMaster restart just similar to any other hbase-site configs.
> For any running cluster, we should be able to stop/resume auto-cleanup 
> activity for snapshot based on shell command. Something similar to below 
> command should be able to stop/start cleanup chore:
> hbase(main):001:0> snapshot_auto_cleanup_switch false    (disable 
> auto-cleaner)
> hbase(main):001:0> snapshot_auto_cleanup_switch true     (enable auto-cleaner)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22653) compile failure using JDK7 and errorProne

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22653:
---
Fix Version/s: (was: 1.4.11)
   1.4.12

> compile failure using JDK7 and errorProne
> -
>
> Key: HBASE-22653
> URL: https://issues.apache.org/jira/browse/HBASE-22653
> Project: HBase
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.5.0, 1.4.11
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Major
> Fix For: 1.5.0, 1.4.12
>
>
> Nightly JDK7 build fails for branch-1 and branch-1.4 when errorProne profile 
> is used.
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.6.1:compile 
> (default-compile) on project hbase-common: Execution default-compile of goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.6.1:compile failed: An API 
> incompatibility was encountered while executing 
> org.apache.maven.plugins:maven-compiler-plugin:3.6.1:compile: 
> java.lang.UnsupportedClassVersionError: javax/tools/DiagnosticListener : 
> Unsupported major.minor version 52.0{noformat}
> https://builds.apache.org/job/HBase%20Nightly/job/branch-1/929/ 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22380) break circle replication when doing bulkload

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917170#comment-16917170
 ] 

Andrew Purtell commented on HBASE-22380:


Can we commit this and then the followup for MOB? 1.4.11 has this pending as 
part of that release and I'd like to get it out to fix the recovered edits 
mis-placement problem in 1.4.10. 

> break circle replication when doing bulkload
> 
>
> Key: HBASE-22380
> URL: https://issues.apache.org/jira/browse/HBASE-22380
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0, 1.5.0, 2.2.0, 1.4.10, 2.0.5, 2.3.0, 2.1.5, 1.3.5
>Reporter: chenxu
>Assignee: Wellington Chevreuil
>Priority: Critical
>  Labels: bulkload
> Fix For: 3.0.0, 1.5.0, 2.3.0, 1.4.11, 2.1.7, 2.2.2
>
>
> when enabled master-master bulkload replication, HFiles will be replicated 
> circularly between two clusters



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22649) FileNotFoundException shown in UI when tried to access HFILE URL of a column family name have special char (e.g #)

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917167#comment-16917167
 ] 

Andrew Purtell commented on HBASE-22649:


Agree it would be better to use a standard constant instead of supply the 
"UTF-8" string. 

> FileNotFoundException shown in UI when tried to access HFILE URL of a column 
> family name have special char (e.g #)
> --
>
> Key: HBASE-22649
> URL: https://issues.apache.org/jira/browse/HBASE-22649
> Project: HBase
>  Issue Type: Bug
>  Components: UI
>Affects Versions: 3.0.0, 2.1.5, 1.3.5
>Reporter: Ashok shetty
>Assignee: Y. SREENIVASULU REDDY
>Priority: Major
> Fix For: 3.0.0, 1.3.6, 2.1.7
>
> Attachments: HBASE-22649.branch-1.002.patch, 
> HBASE-22649.branch-1.patch, HBASE-22649.branch-2.patch, HBASE-22649.patch
>
>
> 【Test step】:
> 1. create 'specialchar' ,'#'
> 2.put 'specialchar','r1','#:cq','1000'
> 3.flush 'specialchar'
> 4.put 'specialchar','r2','#:cq','1000'
> 5.flush 'specialchar'
>  
> Once hfile is created, click the hfile link in UI.
> The following error is throwing.
> {noformat}
> java.io.FileNotFoundException: Path is not a file: 
> /hbase/data/default/specialchar/df9d19830c562c4eeb3f8b396211d52d
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:90)
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:153)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1942)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:739)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:432)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2684)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22872) Don't create normalization plan unnecesarily when split and merge both are disabled

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22872:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Don't create normalization plan unnecesarily when split and merge both are 
> disabled
> ---
>
> Key: HBASE-22872
> URL: https://issues.apache.org/jira/browse/HBASE-22872
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.4.10
>Reporter: Aman Poonia
>Assignee: Aman Poonia
>Priority: Minor
> Fix For: 1.5.0, 2.2.1, 1.3.6, 1.4.11, 2.1.7
>
> Attachments: HBASE-22872.branch-1.4.001.patch, 
> HBASE-22872.branch-1.4.002.patch, HBASE-22872.branch-1.4.003.patch, 
> HBASE-22872.branch-1.4.004.patch, HBASE-22872.branch-1.4.005.patch, 
> HBASE-22872.branch-2.patch, HBASE-22872.master.001.patch, 
> HBASE-22872.master.v01.patch
>
>
> We should not proceed futher in normalization plan creation if split and 
> merge both are disabled on a table.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22706) Backport HBASE-21292 to branch-1

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917160#comment-16917160
 ] 

Andrew Purtell commented on HBASE-22706:


Picked to branch-1.3 also. We've been running this same change internally for a 
while in a 1.3 based code base without issues.

> Backport HBASE-21292 to branch-1
> 
>
> Key: HBASE-22706
> URL: https://issues.apache.org/jira/browse/HBASE-22706
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Pankaj Kumar
>Assignee: Pankaj Kumar
>Priority: Major
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-22706.branch-1.patch, HBASE-22706.branch-1.patch
>
>
> Recently we met the same problem in one of our production env (HBase-1.3.1). 
>  
> I think we missed this,
> https://issues.apache.org/jira/browse/HBASE-21292?focusedCommentId=16656135=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16656135
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HBASE-22706) Backport HBASE-21292 to branch-1

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22706:
---
Fix Version/s: (was: 1.6.0)
   1.3.6
   1.5.0

> Backport HBASE-21292 to branch-1
> 
>
> Key: HBASE-22706
> URL: https://issues.apache.org/jira/browse/HBASE-22706
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Pankaj Kumar
>Assignee: Pankaj Kumar
>Priority: Major
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-22706.branch-1.patch, HBASE-22706.branch-1.patch
>
>
> Recently we met the same problem in one of our production env (HBase-1.3.1). 
>  
> I think we missed this,
> https://issues.apache.org/jira/browse/HBASE-21292?focusedCommentId=16656135=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16656135
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22862) Region Server crash with: Added a key not lexically larger than previous

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917138#comment-16917138
 ] 

Andrew Purtell commented on HBASE-22862:


Yes it's branch-1 only. There is some compatibility code in FixedFileTrailer in 
later versions that still understands the classname but the class itself has 
been removed. 

Does the VisibleForTesting annotation mean the method with the weird and 
possibly incorrect comparison is only used in unit tests? 

> Region Server crash with: Added a key not lexically larger than previous
> 
>
> Key: HBASE-22862
> URL: https://issues.apache.org/jira/browse/HBASE-22862
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 1.4.10
> Environment: {code}
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (Zulu 8.31.0.1-linux64) (build 1.8.0_181-b02)
> OpenJDK 64-Bit Server VM (Zulu 8.31.0.1-linux64) (build 25.181-b02, mixed 
> mode)
> {code}
>Reporter: Alex Batyrshin
>Assignee: Zheng Hu
>Priority: Critical
> Attachments: HBASE-22862.UT.v01.patch, HBASE-22862.UT.v02.patch
>
>
> We observe error "Added a key not lexically larger than previous” that cause 
> most of our region-servers to crash in our cluster.
> {code}
> 2019-08-15 18:02:10,554 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
> Flushing 1/1 column families, memstore=56.08 MB
> 2019-08-15 18:02:10,727 WARN  [MemStoreFlusher.0] regionserver.HStore: Failed 
> flushing store file, retrying num=0
> java.io.IOException: Added a key not lexically larger than previous. Current 
> cell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567,
>  lastCell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770
>at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139)
>at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003)
>at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200)
>at 
> org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
>at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
>at java.lang.Thread.run(Thread.java:748)
> 2019-08-15 18:02:21,776 WARN  [MemStoreFlusher.0] regionserver.HStore: Failed 
> flushing store file, retrying num=9
> java.io.IOException: Added a key not lexically larger than previous. Current 
> cell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567,
>  lastCell = 
> \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770
>at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279)
>at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127)
>at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139)
>at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003)
>at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>at 
> 

[jira] [Comment Edited] (HBASE-22839) Make sure the batches within one region are shipped to the sink clusters in order (branch-1)

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917084#comment-16917084
 ] 

Andrew Purtell edited comment on HBASE-22839 at 8/27/19 8:28 PM:
-

It doesn't matter if the development is done in branch-1 or master first, you 
can do whatever is more convenient. Eventually we will need patches for master, 
branch-2, and branch-1 before they can all be committed, though. 

However we don't need a complete solution of different selectable replication 
models normalized over three branches right now. What we need, as claimed by 
this issue, is an option that ensures batches within one region are shipped to 
the sink clusters in order, to be committed to master, branch-2, and branch-1. 

If something isn't implemented soon, there is no way it will make 1.5. 


was (Author: apurtell):
It doesn't matter if the development is done in branch-1 or master first, you 
can do whatever is more convenient. Eventually we will need patches for master, 
branch-2, and branch-1 before they can all be committed, though. 

However we don't need a complete solution of different selectable replication 
models normalized over three branches right now. What we need, as claimed by 
this issue, is an option that ensures batches within one region are shipped to 
the sink clusters in orde, to be committed to master, branch-2, and branch-1. 

If something isn't implemented soon, there is no way it will make 1.5. 

> Make sure the batches within one region are shipped to the sink clusters in 
> order (branch-1)
> 
>
> Key: HBASE-22839
> URL: https://issues.apache.org/jira/browse/HBASE-22839
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 1.3.4, 1.3.5
>Reporter: Bin Shi
>Assignee: Bin Shi
>Priority: Major
> Fix For: 1.5.0
>
>
> Problem Statement:
> In the cross-cluster replication validation, we found some cells in source 
> and sink cluster can have the same row key, the same timestamp but different 
> values. The happens when mutations with the same row key are submitted in 
> batch without specifying the timestamp, and the same timestamp in the unit of 
> millisecond is assigned at the time when they are committed to the WAL. 
> When this happens, if the major compaction hasn’t happened yet and you scan 
> the table, you can find some cells have the same row key, the same timestamps 
> but different values, like the first three rows in the following table.
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 1|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 2|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 3|
> |Row Key 2|CF0::Column 1|Timestatmp 2|Value 4|
> |Row Key 3|CF0::Column 1|Timestatmp 4|Value 5|
> The ordering of the first three rows is indeterminate in the presence of the 
> cross-replication, so after compaction, in the master cluster you will see 
> “Row Key 1, CF0::Column1, Timestamp1” having the value 3, but in the slave 
> cluster, you might see the cell having one of the three possible values 1, 2, 
> 3, which results in data inconsistency issue between the master and slave 
> clusters.
> Root Cause Analysis:
> In HBaseInterClusterReplicationEndpoint.createBatches() of branch-1.3, the 
> WAL entries from the same region could be split into different batches 
> according to replication RPC limit and these batches are shipped by 
> ReplicationSource concurrently, so the batches for the same region could 
> arrive at the sink in the slave clusters then apply to the region 
> synchronously in indeterminate order.
> Solution:
> In HBase 3.0.0 and 2.1.0, [~Apache9]&[~openinx]&[~fenghh] provided Serial 
> Replication HBASE-20046 which guarantees the order of pushing logs to slave 
> clusters is same as the order of requests from client in the master cluster. 
> It contains mainly two changes:
>  # Recording the replication "barriers" in ZooKeeper to synchronize the 
> replication across old/failed RS and new RS to provide strict ordering 
> semantics even in the presence of region-move or RS failure.
>  # Make sure the batches within one region are shipped to the slave clusters 
> in order.
> The second part of change is exactly what we need and the minimal change to 
> fix the issue in this JIRA.
> To fix the issue in this JIRA, we have two options:
>  # Cherry-Pick HBASE-20046 to branch 1.3. Pros: It also fixes the data 
> inconsistency issue when there is region-move or RS failure and help to 
> reduce the noises in our cross-cluster replication/backup validation which is 
> our ultimate goal. Cons: the change is big and I'm not sure for now whether 
> the change is self-contained or it has other dependencies which need to port 
> to branch 1.3 

[jira] [Updated] (HBASE-22839) Make sure the batches within one region are shipped to the sink clusters in order (branch-1)

2019-08-27 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22839:
---
Fix Version/s: 2.3.0
   3.0.0

> Make sure the batches within one region are shipped to the sink clusters in 
> order (branch-1)
> 
>
> Key: HBASE-22839
> URL: https://issues.apache.org/jira/browse/HBASE-22839
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 1.3.4, 1.3.5
>Reporter: Bin Shi
>Assignee: Bin Shi
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.3.0
>
>
> Problem Statement:
> In the cross-cluster replication validation, we found some cells in source 
> and sink cluster can have the same row key, the same timestamp but different 
> values. The happens when mutations with the same row key are submitted in 
> batch without specifying the timestamp, and the same timestamp in the unit of 
> millisecond is assigned at the time when they are committed to the WAL. 
> When this happens, if the major compaction hasn’t happened yet and you scan 
> the table, you can find some cells have the same row key, the same timestamps 
> but different values, like the first three rows in the following table.
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 1|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 2|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 3|
> |Row Key 2|CF0::Column 1|Timestatmp 2|Value 4|
> |Row Key 3|CF0::Column 1|Timestatmp 4|Value 5|
> The ordering of the first three rows is indeterminate in the presence of the 
> cross-replication, so after compaction, in the master cluster you will see 
> “Row Key 1, CF0::Column1, Timestamp1” having the value 3, but in the slave 
> cluster, you might see the cell having one of the three possible values 1, 2, 
> 3, which results in data inconsistency issue between the master and slave 
> clusters.
> Root Cause Analysis:
> In HBaseInterClusterReplicationEndpoint.createBatches() of branch-1.3, the 
> WAL entries from the same region could be split into different batches 
> according to replication RPC limit and these batches are shipped by 
> ReplicationSource concurrently, so the batches for the same region could 
> arrive at the sink in the slave clusters then apply to the region 
> synchronously in indeterminate order.
> Solution:
> In HBase 3.0.0 and 2.1.0, [~Apache9]&[~openinx]&[~fenghh] provided Serial 
> Replication HBASE-20046 which guarantees the order of pushing logs to slave 
> clusters is same as the order of requests from client in the master cluster. 
> It contains mainly two changes:
>  # Recording the replication "barriers" in ZooKeeper to synchronize the 
> replication across old/failed RS and new RS to provide strict ordering 
> semantics even in the presence of region-move or RS failure.
>  # Make sure the batches within one region are shipped to the slave clusters 
> in order.
> The second part of change is exactly what we need and the minimal change to 
> fix the issue in this JIRA.
> To fix the issue in this JIRA, we have two options:
>  # Cherry-Pick HBASE-20046 to branch 1.3. Pros: It also fixes the data 
> inconsistency issue when there is region-move or RS failure and help to 
> reduce the noises in our cross-cluster replication/backup validation which is 
> our ultimate goal. Cons: the change is big and I'm not sure for now whether 
> the change is self-contained or it has other dependencies which need to port 
> to branch 1.3 too; and we need longer time to validate and stabilize.  
>  # Port the minimal change or make the equivalent change as the second part 
> of HBASE-20046 to make sure the batches within one region are shipped to the 
> slave clusters in order."
> With limited knowledge about HBase Release Schedule and Process, I prefer 
> option 2 because of cons of option 1, but I'm open to option 1 and other 
> options. Thoughts? 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22839) Make sure the batches within one region are shipped to the sink clusters in order (branch-1)

2019-08-27 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917084#comment-16917084
 ] 

Andrew Purtell commented on HBASE-22839:


It doesn't matter if the development is done in branch-1 or master first, you 
can do whatever is more convenient. Eventually we will need patches for master, 
branch-2, and branch-1 before they can all be committed, though. 

However we don't need a complete solution of different selectable replication 
models normalized over three branches right now. What we need, as claimed by 
this issue, is an option that ensures batches within one region are shipped to 
the sink clusters in orde, to be committed to master, branch-2, and branch-1. 

If something isn't implemented soon, there is no way it will make 1.5. 

> Make sure the batches within one region are shipped to the sink clusters in 
> order (branch-1)
> 
>
> Key: HBASE-22839
> URL: https://issues.apache.org/jira/browse/HBASE-22839
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Affects Versions: 1.3.4, 1.3.5
>Reporter: Bin Shi
>Assignee: Bin Shi
>Priority: Major
> Fix For: 1.5.0
>
>
> Problem Statement:
> In the cross-cluster replication validation, we found some cells in source 
> and sink cluster can have the same row key, the same timestamp but different 
> values. The happens when mutations with the same row key are submitted in 
> batch without specifying the timestamp, and the same timestamp in the unit of 
> millisecond is assigned at the time when they are committed to the WAL. 
> When this happens, if the major compaction hasn’t happened yet and you scan 
> the table, you can find some cells have the same row key, the same timestamps 
> but different values, like the first three rows in the following table.
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 1|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 2|
> |Row Key 1|CF0::Column 1|Timestatmp 1|Value 3|
> |Row Key 2|CF0::Column 1|Timestatmp 2|Value 4|
> |Row Key 3|CF0::Column 1|Timestatmp 4|Value 5|
> The ordering of the first three rows is indeterminate in the presence of the 
> cross-replication, so after compaction, in the master cluster you will see 
> “Row Key 1, CF0::Column1, Timestamp1” having the value 3, but in the slave 
> cluster, you might see the cell having one of the three possible values 1, 2, 
> 3, which results in data inconsistency issue between the master and slave 
> clusters.
> Root Cause Analysis:
> In HBaseInterClusterReplicationEndpoint.createBatches() of branch-1.3, the 
> WAL entries from the same region could be split into different batches 
> according to replication RPC limit and these batches are shipped by 
> ReplicationSource concurrently, so the batches for the same region could 
> arrive at the sink in the slave clusters then apply to the region 
> synchronously in indeterminate order.
> Solution:
> In HBase 3.0.0 and 2.1.0, [~Apache9]&[~openinx]&[~fenghh] provided Serial 
> Replication HBASE-20046 which guarantees the order of pushing logs to slave 
> clusters is same as the order of requests from client in the master cluster. 
> It contains mainly two changes:
>  # Recording the replication "barriers" in ZooKeeper to synchronize the 
> replication across old/failed RS and new RS to provide strict ordering 
> semantics even in the presence of region-move or RS failure.
>  # Make sure the batches within one region are shipped to the slave clusters 
> in order.
> The second part of change is exactly what we need and the minimal change to 
> fix the issue in this JIRA.
> To fix the issue in this JIRA, we have two options:
>  # Cherry-Pick HBASE-20046 to branch 1.3. Pros: It also fixes the data 
> inconsistency issue when there is region-move or RS failure and help to 
> reduce the noises in our cross-cluster replication/backup validation which is 
> our ultimate goal. Cons: the change is big and I'm not sure for now whether 
> the change is self-contained or it has other dependencies which need to port 
> to branch 1.3 too; and we need longer time to validate and stabilize.  
>  # Port the minimal change or make the equivalent change as the second part 
> of HBASE-20046 to make sure the batches within one region are shipped to the 
> slave clusters in order."
> With limited knowledge about HBase Release Schedule and Process, I prefer 
> option 2 because of cons of option 1, but I'm open to option 1 and other 
> options. Thoughts? 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (HBASE-21521) Expose master startup status via JMX and web UI

2019-08-26 Thread Andrew Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916199#comment-16916199
 ] 

Andrew Purtell edited comment on HBASE-21521 at 8/26/19 11:00 PM:
--

What stack said.

However, the inspiration for this idea comes in part from what HDFS does for 
their namenode: HDFS-4249 . Judging by the screenshot on that JIRA a lot of 
front end knowledge was not required. Looks like a table and some basic 
formatting. rendered once. To update you refetch at the browser. 

Namenode startup can be orders of magnitude longer than HBase master startup 
for all but the really abnormal cases (like hundreds of queued 
ServerCrashProcedures) so the pressure to do a detailed phase by phase status 
report update is less, but would still be nice. 


was (Author: apurtell):
What stack said.

However, the inspiration for this idea comes in part from what HDFS does for 
their namenode: HDFS-4249

Namenode startup can be orders of magnitude longer than HBase master startup 
for all but the really abnormal cases (like hundreds of queued 
ServerCrashProcedures) so the pressure to do a detailed phase by phase status 
report update is less, but would still be nice. 

> Expose master startup status via JMX and web UI
> ---
>
> Key: HBASE-21521
> URL: https://issues.apache.org/jira/browse/HBASE-21521
> Project: HBase
>  Issue Type: Improvement
>  Components: master, UI
>Reporter: Andrew Purtell
>Priority: Major
>
> Add an internal API to the master for tracking startup progress. Expose this 
> information via JMX.
> Modify the master to bring the web UI up sooner. Will require tweaks to 
> various views to prevent attempts to retrieve state before the master fully 
> up (or else expect NPEs). Currently, before the master has fully initialized 
> an attempt to use the web UI will return a 500 error code and display an 
> error page.
> Finally, update the web UI to display startup progress, like HDFS-4249. 
> Filing this for branch-1. Need to check what if anything is available or 
> improved in branch-2 and master.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


  1   2   3   4   5   6   7   8   9   10   >