[ https://issues.apache.org/jira/browse/HBASE-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049913#comment-16049913 ]
Josh Elser commented on HBASE-18023: ------------------------------------ How about {{hbase.rpc.rows.warning.threshold}} instead of {hbase.batch.size.threshold}}. Including some form of "warn" is good and I was trying to avoid the confusion (at sight) between size being "number of rows" and "number of bytes". {code} + int threshold = conf.getInt(HConstants.BATCH_SIZE_THRESHOLD_NAME, HConstants.BATCH_SIZE_THRESHOLD_DEFAULT); {code} We should cache this to save on frequently accessing the value from the configuration. We can do it when HRegion is constructed (or RSRpcServices is constructed, based on more comments to come..). {code} + LOG.warn("Large batch operation detected (greater than "+threshold+") (See https://issues.apache.org/jira/browse/HBASE-18023)." {code} Personally, I don't think the JIRA URL is going to be of much value to an admin. I would strike that part. {code} + + " Length: "+batchOp.operations.length {code} How about "Requested number of rows:" instead? {code} @@ -3200,6 +3209,7 @@ public class HRegion implements HeapSize, PropagatingConfigurationObserver, Regi List<RowLock> acquiredRowLocks = Lists.newArrayListWithCapacity(batchOp.operations.length); MemstoreSize memstoreSize = new MemstoreSize(); final ObservedExceptionsInBatch observedExceptions = new ObservedExceptionsInBatch(); + checkBatchSizeAndLogLargeSize(batchOp); {code} I'm a little worried about the case where one RPC contains a updates for a number of regions. Your current case properly handles the case where there are more than 1000 updates to a single region. However, multi RPCs can contain actions for many regions in one RPC. So, if I crafted a multi RPC in which I included 1000 Region action objects each of which only have 999 operations (so, 999 operations for 1000 different regions for a total of 999000 actions), the current logging wouldn't catch anything. What about lifting this method out of HRegion and into {{RSRpcServices#multi}} instead? This way, we can invoke your method on the total RPC request object sent in, not a portion of it. Does this make sense? Nice test addition too by the way. We might be able to make your test a little more stable by avoiding writing a test expecting a specific message is logged. Instead, what if you moved your {{LOG.warn}} to its own method, mocked out the invocation of that method and then verified that the method was called (just a different way to show that the log statement would be printed in a real system). > Log multi-* requests for more than threshold number of rows > ----------------------------------------------------------- > > Key: HBASE-18023 > URL: https://issues.apache.org/jira/browse/HBASE-18023 > Project: HBase > Issue Type: Improvement > Components: regionserver > Reporter: Clay B. > Assignee: David Harju > Priority: Minor > Attachments: HBASE-18023.master.001.patch > > > Today, if a user happens to do something like a large multi-put, they can get > through request throttling (e.g. it is one request) but still crash a region > server with a garbage storm. We have seen regionservers hit this issue and it > is silent and deadly. The RS will report nothing more than a mysterious > garbage collection and exit out. > Ideally, we could report a large multi-* request before starting it, in case > it happens to be deadly. Knowing the client, user and how many rows are > affected would be a good start to tracking down painful users. -- This message was sent by Atlassian JIRA (v6.4.14#64029)