[ 
https://issues.apache.org/jira/browse/HBASE-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049913#comment-16049913
 ] 

Josh Elser commented on HBASE-18023:
------------------------------------

How about {{hbase.rpc.rows.warning.threshold}} instead of 
{hbase.batch.size.threshold}}. Including some form of "warn" is good and I was 
trying to avoid the confusion (at sight) between size being "number of rows" 
and "number of bytes".

{code}
+      int threshold = conf.getInt(HConstants.BATCH_SIZE_THRESHOLD_NAME, 
HConstants.BATCH_SIZE_THRESHOLD_DEFAULT);
{code}

We should cache this to save on frequently accessing the value from the 
configuration. We can do it when HRegion is constructed (or RSRpcServices is 
constructed, based on more comments to come..).

{code}
+          LOG.warn("Large batch operation detected (greater than 
"+threshold+") (See https://issues.apache.org/jira/browse/HBASE-18023)."
{code}

Personally, I don't think the JIRA URL is going to be of much value to an 
admin. I would strike that part.

{code}
+                  + " Length: "+batchOp.operations.length
{code}

How about "Requested number of rows:" instead?

{code}
@@ -3200,6 +3209,7 @@ public class HRegion implements HeapSize, 
PropagatingConfigurationObserver, Regi
     List<RowLock> acquiredRowLocks = 
Lists.newArrayListWithCapacity(batchOp.operations.length);
     MemstoreSize memstoreSize = new MemstoreSize();
     final ObservedExceptionsInBatch observedExceptions = new 
ObservedExceptionsInBatch();
+    checkBatchSizeAndLogLargeSize(batchOp);
{code}

I'm a little worried about the case where one RPC contains a updates for a 
number of regions. Your current case properly handles the case where there are 
more than 1000 updates to a single region.

However, multi RPCs can contain actions for many regions in one RPC. So, if I 
crafted a multi RPC in which I included 1000 Region action objects each of 
which only have 999 operations (so, 999 operations for 1000 different regions 
for a total of 999000 actions), the current logging wouldn't catch anything. 
What about lifting this method out of HRegion and into {{RSRpcServices#multi}} 
instead? This way, we can invoke your method on the total RPC request object 
sent in, not a portion of it. Does this make sense?

Nice test addition too by the way. We might be able to make your test a little 
more stable by avoiding writing a test expecting a specific message is logged. 
Instead, what if you moved your {{LOG.warn}} to its own method, mocked out the 
invocation of that method and then verified that the method was called (just a 
different way to show that the log statement would be printed in a real system).

> Log multi-* requests for more than threshold number of rows
> -----------------------------------------------------------
>
>                 Key: HBASE-18023
>                 URL: https://issues.apache.org/jira/browse/HBASE-18023
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Clay B.
>            Assignee: David Harju
>            Priority: Minor
>         Attachments: HBASE-18023.master.001.patch
>
>
> Today, if a user happens to do something like a large multi-put, they can get 
> through request throttling (e.g. it is one request) but still crash a region 
> server with a garbage storm. We have seen regionservers hit this issue and it 
> is silent and deadly. The RS will report nothing more than a mysterious 
> garbage collection and exit out.
> Ideally, we could report a large multi-* request before starting it, in case 
> it happens to be deadly. Knowing the client, user and how many rows are 
> affected would be a good start to tracking down painful users.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to