[ https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042638#comment-17042638 ]
Viraj Jasani commented on HBASE-22978: -------------------------------------- Update on the progress: I have addressed all concerns in latest reviews, almost ready for commit to master branch and then will prepare branch-2 backport. Will create all follow up tasks soon. > Online slow response log > ------------------------ > > Key: HBASE-22978 > URL: https://issues.apache.org/jira/browse/HBASE-22978 > Project: HBase > Issue Type: New Feature > Components: Admin, Operability, regionserver, shell > Affects Versions: 3.0.0, 2.3.0, 1.5.1 > Reporter: Andrew Kyle Purtell > Assignee: Viraj Jasani > Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.7.0 > > Attachments: Screen Shot 2019-10-19 at 2.31.59 AM.png, Screen Shot > 2019-10-19 at 2.32.54 AM.png, Screen Shot 2019-10-19 at 2.34.11 AM.png, > Screen Shot 2019-10-19 at 2.36.14 AM.png > > > Today when an individual RPC exceeds a configurable time bound we log a > complaint by way of the logging subsystem. These log lines look like: > {noformat} > 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - > (responseTooSlow): > {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)", > "starttimems":1567203007549, > "responsesize":6819737, > "method":"Scan", > "param":"region { type: REGION_NAME value: > \"tsdb,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>", > "processingtimems":28646, > "client":"10.253.196.215:41116", > "queuetimems":22453, > "class":"HRegionServer"} > {noformat} > Unfortunately we often truncate the request parameters, like in the above > example. We do this because the human readable representation is verbose, the > rate of too slow warnings may be high, and the combination of these things > can overwhelm the log capture system. The truncation is unfortunate because > it eliminates much of the utility of the warnings. For example, the region > name, the start and end keys, and the filter hierarchy are all important > clues for debugging performance problems caused by moderate to low > selectivity queries or queries made at a high rate. > We can maintain an in-memory ring buffer of requests that were judged to be > too slow in addition to the responseTooSlow logging. The in-memory > representation can be complete and compressed. A new admin API and shell > command can provide access to the ring buffer for online performance > debugging. A modest sizing of the ring buffer will prevent excessive memory > utilization for a minor performance debugging feature by limiting the total > number of retained records. There is some chance a high rate of requests will > cause information on other interesting requests to be overwritten before it > can be read. This is the nature of a ring buffer and an acceptable trade off. > The write request types do not require us to retain all information submitted > in the request. We don't need to retain all key-values in the mutation, which > may be too large to comfortably retain. We only need a unique set of row > keys, or even a min/max range, and total counts. > The consumers of this information will be debugging tools. We can afford to > apply fast compression to ring buffer entries (if codec support is > available), something like snappy or zstandard, and decompress on the fly > when servicing the retrieval API request. This will minimize the impact of > retaining more information about slow requests than we do today. > This proposal is for retention of request information only, the same > information provided by responseTooSlow warnings. Total size of response > serialization, possibly also total cell or row counts, should be sufficient > to characterize the response. > Optionally persist new entries added to the ring buffer into one or more > files in HDFS in a write-behind manner. If the HDFS writer blocks or falls > behind and we are unable to persist an entry before it is overwritten, that > is fine. Response too slow logging is best effort. If we can detect this make > a note of it in the log file. Provide a tool for parsing, dumping, filtering, > and pretty printing the slow logs written to HDFS. The tool and the shell can > share and reuse some utility classes and methods for accomplishing that. > — > New shell commands: > {{get_slow_responses [ <server1> ... , <serverN> ] [ , \{ <filter-attributes> > } ]}} > Retrieve, decode, and pretty print the contents of the too slow response ring > buffer maintained by the given list of servers; or all servers in the cluster > if no list is provided. Optionally provide a map of parameters for filtering > as additional argument. The TABLE filter, which expects a string containing a > table name, will include only entries pertaining to that table. The REGION > filter, which expects a string containing an encoded region name, will > include only entries pertaining to that region. The CLIENT_IP filter, which > expects a string containing an IP address, will include only entries > pertaining to that client. The USER filter, which expects a string containing > a user name, will include only entries pertaining to that user. Filters are > additive, for example if both CLIENT_IP and USER filters are given, entries > matching either or both conditions will be included. The exception to this is > if both TABLE and REGION filters are provided, REGION will be preferred and > TABLE will be ignored. A server name is its host, port, and start code, e.g. > "host187.example.com,60020,1289493121758". > {{clear_slow_responses [ <server1> ... , <serverN> ]}} > Clear the too slow response ring buffer maintained by the given list of > servers; or all servers on the cluster if no argument is provided. A server > name is its host, port, and start code, e.g. > "host187.example.com,60020,1289493121758". > — > New Admin APIs: > {code:java} > List<ResponseDetail> Admin#getSlowResponses(@Nullable List<String> servers); > {code} > {code:java} > List<ResponseDetail> Admin#clearSlowResponses(@Nullable List<String> servers); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)