Looks like https://issues.apache.org/jira/browse/HBASE-19215 to me. Passing -Djdk.nio.maxCachedBufferSize=262144 to the client JVM might prevent this from happening.
pt., 11 sty 2019 o 22:04 Buchi Reddy Busi Reddy <[email protected]> napisał(a): > > Hi, > > In our production, we have been seeing this strange issue where few > RegionServers see the below exceptions and then their memory usage, GC time > increase and eventually the > > JvmPauseMonitor task kills the RegionServer because of high GC pauses. > > 2018-12-19 08:54:25,474 ERROR > [RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020] > ipc.RpcServer: Unexpected throwable object > java.lang.NullPointerException > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2143) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) > 2018-12-19 08:54:25,475 WARN > [RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020] > ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020: > caught a ClosedChannelException, this means that the server > /10.130.157.159:60020 was processing a request but the client went > away. The error message was: null > 2018-12-19 08:54:25,662 INFO > [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_1] > regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore > missed its start time > 2018-12-19 08:54:25,685 ERROR > [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=60020] > ipc.RpcServer: Unexpected throwable object > java.lang.NegativeArraySizeException > at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76) > at org.apache.hadoop.hbase.client.Put.add(Put.java:315) > at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) > > 2018-12-19 08:55:27,482 ERROR > [RpcServer.FifoWFPBQ.default.handler=14,queue=2,port=60020] > ipc.RpcServer: Unexpected throwable object > java.lang.NegativeArraySizeException > at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76) > at org.apache.hadoop.hbase.client.Put.add(Put.java:315) > at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) > > 2018-12-19 09:06:46,527 INFO [JvmPauseMonitor] util.JvmPauseMonitor: > Detected pause in JVM or host machine (eg GC): pause of approximately > 6491ms > GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=6972ms > 2018-12-19 09:07:04,207 INFO > [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] > regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore > missed its start time > 2018-12-19 09:07:04,207 INFO > [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] > regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker > missed its start time > 2018-12-19 09:07:04,207 INFO > [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] > regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore: > ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore > missed its start time > 2018-12-19 09:07:04,208 WARN [ResponseProcessor for block > BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914] > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914 > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849) > 2018-12-19 09:07:04,208 WARN > [regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1] > wal.MetricsWAL: > regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1 > took 41693 ms appending an edit to wal; len~=439.1 K > 2018-12-19 09:07:09,916 WARN [ResponseProcessor for block > BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913] > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913 > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849) > 2018-12-19 09:07:41,667 WARN [JvmPauseMonitor] util.JvmPauseMonitor: > Detected pause in JVM or host machine (eg GC): pause of approximately > 17182ms > GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=17667ms > 2018-12-19 09:07:41,667 INFO > [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_4] > regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore: > ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore > missed its start time > 2018-12-19 09:07:41,667 INFO > [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] > regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore > missed its start time > 2018-12-19 09:07:49,701 INFO > [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_3] > regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker > missed its start time > 2018-12-19 09:10:36,457 WARN [main] util.HeapMemorySizeUtil: > hbase.regionserver.global.memstore.upperLimit is deprecated by > hbase.regionserver.global.memstore.size > 2018-12-19 09:10:38,993 INFO [main] zookeeper.ZooKeeper: Client > environment:zookeeper.version=3.4.6-235--1, built on 10/30/2017 02:17 > GMT > 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client > environment:host.name=ip-10-130-157-159.us-west-2.compute.internal > 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client > environment:java.version=1.8.0_151 > 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client > environment:java.vendor=Oracle Corporation > > 2018-12-15 15:02:12,875 WARN > [RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020] > ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020: > caught a ClosedChannelException, this means that the server > /10.130.146.146:60020 was processing a request but the client went > away. The error message was: null > 2018-12-15 15:02:13,824 ERROR > [RpcServer.FifoWFPBQ.default.handler=18,queue=0,port=60020] > ipc.RpcServer: Unexpected throwable object > java.lang.IllegalArgumentException: offset (369668079) + length (8) > exceed the capacity of the array: 1640976 > at > org.apache.hadoop.hbase.util.Bytes.explainWrongLengthOrOffset(Bytes.java:631) > at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:605) > at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:590) > at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1241) > at org.apache.hadoop.hbase.KeyValue.toString(KeyValue.java:1181) > at org.apache.hadoop.hbase.client.Put.add(Put.java:321) > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) > > > > We initially suspected that some packet corruption must be happening > over network but that doesn't seem to the case. However, we have found > that few of the HBase clients which are doing Put operations (in > batches) are running with OutOfMemory issues due to Direct buffer > memory and were failing to write to HBase. > > Here are the logs from one of the HBase clients which had OutOfMemory errors. > > 2019-01-10 07:58:16.892 [HBase WRITE-3] INFO > o.a.hadoop.hbase.client.AsyncProcess - Left over 9 task(s) are > processed on server(s): > [ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594, > ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199, > ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029, > ip-10-130-158-137.us-west-2.compute.internal,60020,1545261163484, > ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003, > ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253, > ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160] > 2019-01-10 07:58:27.452 [HBase WRITE-2] INFO > o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE > 2019-01-10 07:58:27.503 [HBase WRITE-4] INFO > o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE > 2019-01-10 07:58:29.535 [htable-pool3231971-t2] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070503, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=239ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 239 - > final failure > 2019-01-10 07:58:29.540 [htable-pool3231971-t7] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070503, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=2567ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2567 - > final failure > 2019-01-10 07:58:29.542 [htable-pool3231971-t6] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070503, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=2694ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2694 - > final failure > 2019-01-10 07:58:29.542 [htable-pool3231971-t3] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070503, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=2820ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2820 - > final failure > 2019-01-10 07:58:29.550 [htable-pool3231972-t7] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070504, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=1620ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1620 - > final failure > 2019-01-10 07:58:29.551 [htable-pool3231971-t4] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070503, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=907ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 907 - > final failure > 2019-01-10 07:58:29.551 [htable-pool3231971-t8] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070503, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=2657ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2657 - > final failure > 2019-01-10 07:58:29.561 [htable-pool3231972-t8] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070504, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=1637ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1637 - > final failure > 2019-01-10 07:58:29.562 [htable-pool3231972-t2] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070504, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=140ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 140 - > final failure > 2019-01-10 07:58:29.563 [htable-pool3231972-t5] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070504, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=2769ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2769 - > final failure > 2019-01-10 07:58:29.564 [htable-pool3231972-t6] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070504, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=1623ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1623 - > final failure > 2019-01-10 07:58:29.565 [htable-pool3231972-t4] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070504, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=625ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 625 - > final failure > 2019-01-10 07:58:29.567 [htable-pool3231972-t3] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070504, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=1721ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1721 - > final failure > 2019-01-10 07:58:29.574 [htable-pool3231971-t5] WARN > o.a.hadoop.hbase.client.AsyncProcess - #3070503, > table=BLITZ_METRIC_NODE, attempt=5/5 failed=4782ops, last exception: > java.io.IOException: com.google.protobuf.ServiceException: > java.lang.OutOfMemoryError: Direct buffer memory on > ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029, > tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 4782 - > final failure > 2019-01-10 07:58:29.629 [HBase WRITE-4] ERROR > c.a.b.p.c.h.w.AbstractHBaseWriter - Failed 10135 actions: IOException: > 10135 times, > > > The errors on RegionServer seems to happen right around the time when > clients are seeing the OutOfMemory errors and from the timestamps, > it's not super clear to determine the cause vs effect. > > > *Question:* > > Is it actually possible for the HBase client to send malformed/partial > RPC messages to RS if the JVM is suffering any OutOfMemory issues due > to less direct memory? It would be helpful to know the root cause here > so that we can remediate the issue accordingly. > > > These
