[jira] [Updated] (HBASE-22978) Online slow response log

2023-04-13 Thread Liangjun He (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangjun He updated HBASE-22978:

Attachment: (was: Alluxio 开源AI和大数据存储编排平台.pdf)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: 
> NamedQueue_Framework_Design_HBASE-24528_HBASE-22978_HBASE-24718.pdf, Screen 
> Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 2019-10-19 at 2.32.54 AM.png, 
> Screen Shot 2019-10-19 at 2.34.11 AM.png, Screen Shot 2019-10-19 at 2.36.14 
> AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter

[jira] [Updated] (HBASE-22978) Online slow response log

2023-04-13 Thread Liangjun He (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangjun He updated HBASE-22978:

Attachment: (was: Flink Table Store 流计算存储.pptx)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: 
> NamedQueue_Framework_Design_HBASE-24528_HBASE-22978_HBASE-24718.pdf, Screen 
> Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 2019-10-19 at 2.32.54 AM.png, 
> Screen Shot 2019-10-19 at 2.34.11 AM.png, Screen Shot 2019-10-19 at 2.36.14 
> AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filt

[jira] [Updated] (HBASE-22978) Online slow response log

2023-04-06 Thread Liangjun He (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangjun He updated HBASE-22978:

Attachment: Alluxio 开源AI和大数据存储编排平台.pdf
Flink Table Store 流计算存储.pptx

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: Alluxio 开源AI和大数据存储编排平台.pdf, Flink Table Store 
> 流计算存储.pptx, 
> NamedQueue_Framework_Design_HBASE-24528_HBASE-22978_HBASE-24718.pdf, Screen 
> Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 2019-10-19 at 2.32.54 AM.png, 
> Screen Shot 2019-10-19 at 2.34.11 AM.png, Screen Shot 2019-10-19 at 2.36.14 
> AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string conta

[jira] [Updated] (HBASE-22978) Online slow response log

2021-10-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: 
NamedQueue_Framework_Design_HBASE-24528_HBASE-22978_HBASE-24718.pdf

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: 
> NamedQueue_Framework_Design_HBASE-24528_HBASE-22978_HBASE-24718.pdf, Screen 
> Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 2019-10-19 at 2.32.54 AM.png, 
> Screen Shot 2019-10-19 at 2.34.11 AM.png, Screen Shot 2019-10-19 at 2.36.14 
> AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to

[jira] [Updated] (HBASE-22978) Online slow response log

2021-10-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: (was: NamedQueue_Framework_Design.pdf)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: 
> NamedQueue_Framework_Design_HBASE-24528_HBASE-22978_HBASE-24718.pdf, Screen 
> Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 2019-10-19 at 2.32.54 AM.png, 
> Screen Shot 2019-10-19 at 2.34.11 AM.png, Screen Shot 2019-10-19 at 2.36.14 
> AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
>

[jira] [Updated] (HBASE-22978) Online slow response log

2021-10-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: NamedQueue_Framework_Design.pdf

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: NamedQueue_Framework_Design.pdf, Screen Shot 2019-10-19 
> at 2.31.59 AM.png, Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 
> 2019-10-19 at 2.34.11 AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded

[jira] [Updated] (HBASE-22978) Online slow response log

2021-10-06 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Priority: Major  (was: Minor)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pert

[jira] [Updated] (HBASE-22978) Online slow response log

2020-03-05 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Release Note: 
get_slowlog_responses and clear_slowlog_responses are used to retrieve and 
clear slow RPC logs from RingBuffer maintained by RegionServers.

New Admin APIs:
1.   List getSlowLogResponses(final Set serverNames,
  final SlowLogQueryFilter slowLogQueryFilter) throws IOException;

2.   List clearSlowLogResponses(final Set serverNames)
  throws IOException;

Configs:

1. hbase.regionserver.slowlog.ringbuffer.size:
Default size of ringbuffer to be maintained by each RegionServer in order to 
store online slowlog responses. This is an in-memory ring buffer of requests 
that were judged to be too slow in addition to the responseTooSlow logging. The 
in-memory representation would be complete. For more details, please look into 
Doc Section: Get Slow Response Log from shell

Default
256

2. hbase.regionserver.slowlog.buffer.enabled:
Indicates whether RegionServers have ring buffer running for storing Online 
Slow logs in FIFO manner with limited entries. The size of the ring buffer is 
indicated by config: hbase.regionserver.slowlog.ringbuffer.size The default 
value is false, turn this on and get latest slowlog responses with complete 
data.

Default
false


For more details, please look into "Get Slow Response Log from shell" section 
from HBase book.

  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the

[jira] [Updated] (HBASE-22978) Online slow response log

2020-03-05 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Fix Version/s: (was: 1.7.0)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining 

[jira] [Updated] (HBASE-22978) Online slow response log

2020-02-13 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-22978:

Fix Version/s: (was: 1.6.0)
   1.7.0

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.7.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region 

[jira] [Updated] (HBASE-22978) Online slow response log

2020-01-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Fix Version/s: 2.3.0

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to t

[jira] [Updated] (HBASE-22978) Online slow response log

2020-01-23 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-22978:
--
Fix Version/s: (was: 2.3.0)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertainin

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-30 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Fix Version/s: (was: 1.5.1)
   1.6.0

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Fix Version/s: 1.5.1
   2.3.0
   3.0.0

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0, 1.5.1
>
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region 

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-28 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Affects Version/s: 1.5.1
   2.3.0
   3.0.0

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Affects Versions: 3.0.0, 2.3.0, 1.5.1
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entrie

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-24 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: (was: HBASE-22978.master.001.patch)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a stri

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-24 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: (was: HBASE-22978.master.000.patch)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot 
> 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, 
> Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a stri

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: Screen Shot 2019-10-19 at 2.36.14 AM.png
Screen Shot 2019-10-19 at 2.34.11 AM.png
Screen Shot 2019-10-19 at 2.32.54 AM.png
Screen Shot 2019-10-19 at 2.31.59 AM.png

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch, 
> HBASE-22978.master.001.patch, Screen Shot 2019-10-19 at 2.31.59 AM.png, 
> Screen Shot 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 
> AM.png, Screen Shot 2019-10-19 at 2.36.14 AM.png
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-19 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: HBASE-22978.master.001.patch

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch, 
> HBASE-22978.master.001.patch
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a stri

[jira] [Updated] (HBASE-22978) Online slow response log

2019-10-19 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-22978:
-
Attachment: HBASE-22978.master.000.patch
Status: Patch Available  (was: In Progress)

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Kyle Purtell
>Assignee: Viraj Jasani
>Priority: Minor
> Attachments: HBASE-22978.master.000.patch
>
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, w

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Sean Busbey (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-22978:

Component/s: Operability

> Online slow response log
> 
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
>  Issue Type: New Feature
>  Components: Admin, Operability, regionserver, shell
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more 
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls 
> behind and we are unable to persist an entry before it is overwritten, that 
> is fine. Response too slow logging is best effort. If we can detect this make 
> a note of it in the log file. Provide a tool for parsing, dumping, filtering, 
> and pretty printing the slow logs written to HDFS. The tool and the shell can 
> share and reuse some utility classes and methods for accomplishing that. 
> —
> New shell commands:
> {{get_slow_responses [  ... ,  ] [ , \{  
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers in the cluster 
> if no list is provided. Optionally provide a map of parameters for filtering 
> as additional argument. The TABLE filter, which expects a string containing a 
> table name, will include only entries pertaining to that table. The REGION 
> filter, which expects a string containing an encoded region name, will 
> include only entries pertaining to that region. The CLIENT_IP filter, which 
> expects a string containing an IP address, will include only entries 
> pertaining to that client. The USER filter, which expects a string containing 
> a user name, will include only entries pertaining to that user. Filters are 
> additive, fo

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. Provide a tool for parsing, dumping, filtering, and pretty 
printing the slow logs written to HDFS. The tool and the shell can share and 
reuse some utility classes and methods for accomplishing that. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. Filters are additive, for example 
if both CLIENT_IP and USER filters are given, entries matching either or both 
conditions will be included. The exception to this is if both TABLE and REGION 
filters are provided, REGION will be preferred and TABLE will be ignored. A 
server name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. Filters are additive, for example 
if both CLIENT_IP and USER filters are given, entries matching either or both 
conditions will be included. The exception to this is if both TABLE and REGION 
filters are provided, REGION will be preferred and TABLE will be ignored. A 
server name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List A

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. If both TABLE and REGION filters 
are provided, REGION will be used. A server name is its host, port, and start 
code, e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

Optionally persist new entries added to the ring buffer into one or more files 
in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and 
we are unable to persist an entry before it is overwritten, that is fine. 
Response too slow logging is best effort. If we can detect this make a note of 
it in the log file. 

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. A server name is its host, port, 
and start code, e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsy

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses [  ... ,  ] [ , \{  } 
]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers in the cluster 
if no list is provided. Optionally provide a map of parameters for filtering as 
additional argument. The TABLE filter, which expects a string containing a 
table name, will include only entries pertaining to that table. The REGION 
filter, which expects a string containing an encoded region name, will include 
only entries pertaining to that region. The CLIENT_IP filter, which expects a 
string containing an IP address, will include only entries pertaining to that 
client. The USER filter, which expects a string containing a user name, will 
include only entries pertaining to that user. A server name is its host, port, 
and start code, e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. The truncation is unfortunate because it 
eliminates much of the utility of the warnings. For example, the region name, 
the start and end keys, and the filter hierarchy are all important clues for 
debugging performance problems caused by moderate to low selectivity queries or 
queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses  [ , \{ SERVERS=> } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer. Provide a table name as first argument to find all regions and retrieve 
too slow response entries for the given table from all servers currently 
hosting it. Provide a region name as first argument to retrieve all too slow 
response entries for the given region. Optionally provide a map of parameters 
as second argument. The SERVERS parameter, which expects a list of server 
names, will constrain the search to the given set of servers. A server name is 
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

{{get_slow_responses [  ... ,  ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers on the cluster 
if no argument is provided. A server name is its host, port, and start code, 
e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan","param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. (All of these things have been reported from 
various production settings. Truncation was added after we crashed a user's log 
capture system.) The truncation is unfortunate because it eliminates much of 
the utility of the warnings. For example, the region name, the start and end 
keys, and the filter hierarchy are all important clues for debugging 
performance problems caused by moderate to low selectivity queries or queries 
made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses  [ , \{ SERVERS=> } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer. Provide a table name as first argument to find all regions and retrieve 
too slow response entries for the given table from all servers currently 
hosting it. Provide a region name as first argument to retrieve all too slow 
response entries for the given region. Optionally provide a map of parameters 
as second argument. The SERVERS parameter, which expects a list of server 
names, will constrain the search to the given set of servers. A server name is 
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

{{get_slow_responses [  ... ,  ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers on the cluster 
if no argument is provided. A server name is its host, port, and start code, 
e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.

[jira] [Updated] (HBASE-22978) Online slow response log

2019-09-05 Thread Andrew Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-22978:
---
Description: 
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan","param":"region { type: REGION_NAME value: 
\"tsdb,\\000\\000\\215\\f)o\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}

{noformat}

Unfortunately we often truncate the request parameters, like in the above 
example. We do this because the human readable representation is verbose, the 
rate of too slow warnings may be high, and the combination of these things can 
overwhelm the log capture system. (All of these things have been reported from 
various production settings. Truncation was added after we crashed a user's log 
capture system.) The truncation is unfortunate because it eliminates much of 
the utility of the warnings. For example, the region name, the start and end 
keys, and the filter hierarchy are all important clues for debugging 
performance problems caused by moderate to low selectivity queries or queries 
made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too 
slow in addition to the responseTooSlow logging. The in-memory representation 
can be complete and compressed. A new admin API and shell command can provide 
access to the ring buffer for online performance debugging. A modest sizing of 
the ring buffer will prevent excessive memory utilization for a minor 
performance debugging feature by limiting the total number of retained records. 
There is some chance a high rate of requests will cause information on other 
interesting requests to be overwritten before it can be read. This is the 
nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted 
in the request. We don't need to retain all key-values in the mutation, which 
may be too large to comfortably retain. We only need a unique set of row keys, 
or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to 
apply fast compression to ring buffer entries (if codec support is available), 
something like snappy or zstandard, and decompress on the fly when servicing 
the retrieval API request. This will minimize the impact of retaining more 
information about slow requests than we do today.

This proposal is for retention of request information only, the same 
information provided by responseTooSlow warnings. Total size of response 
serialization, possibly also total cell or row counts, should be sufficient to 
characterize the response.

—

New shell commands:

{{get_slow_responses  [ , \{ SERVERS=> } ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer. Provide a table name as first argument to find all regions and retrieve 
too slow response entries for the given table from all servers currently 
hosting it. Provide a region name as first argument to retrieve all too slow 
response entries for the given region. Optionally provide a map of parameters 
as second argument. The SERVERS parameter, which expects a list of server 
names, will constrain the search to the given set of servers. A server name is 
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

{{get_slow_responses [  ... ,  ]}}

Retrieve, decode, and pretty print the contents of the too slow response ring 
buffer maintained by the given list of servers; or all servers on the cluster 
if no argument is provided. A server name is its host, port, and start code, 
e.g. "host187.example.com,60020,1289493121758".

{{clear_slow_responses [  ... ,  ]}}

Clear the too slow response ring buffer maintained by the given list of 
servers; or all servers on the cluster if no argument is provided. A server 
name is its host, port, and start code, e.g. 
"host187.example.com,60020,1289493121758".

—

New Admin APIs:
{code:java}
List Admin#getSlowResponses(String tableOrRegion, @Nullable 
List servers);
{code}
{code:java}
List Admin#getSlowResponses(@Nullable List servers);
{code}
{code:java}
List Admin#clearSlowResponses(@Nullable List servers);
{code}

  was:
Today when an individual RPC exceeds a configurable time bound we log a 
complaint by way of the logging subsystem. These log lines look like:

{noformat}

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ip