Re: RegionServer is dying with memory issues after it receives malformed packets

Daniel Jeliński Sun, 13 Jan 2019 23:23:49 -0800

Looks like https://issues.apache.org/jira/browse/HBASE-19215 to me.
Passing -Djdk.nio.maxCachedBufferSize=262144 to the client JVM might
prevent this from happening.



pt., 11 sty 2019 o 22:04 Buchi Reddy Busi Reddy
<[email protected]> napisał(a):
>
> Hi,
>
> In our production, we have been seeing this strange issue where few
> RegionServers see the below exceptions and then their memory usage, GC time
> increase and eventually the
>
> JvmPauseMonitor task kills the RegionServer because of high GC pauses.
>
> 2018-12-19 08:54:25,474 ERROR
> [RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020]
> ipc.RpcServer: Unexpected throwable object
> java.lang.NullPointerException
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2143)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
> 2018-12-19 08:54:25,475 WARN
> [RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020]
> ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020:
> caught a ClosedChannelException, this means that the server
> /10.130.157.159:60020 was processing a request but the client went
> away. The error message was: null
> 2018-12-19 08:54:25,662 INFO
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_1]
> regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore
> missed its start time
> 2018-12-19 08:54:25,685 ERROR
> [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=60020]
> ipc.RpcServer: Unexpected throwable object
> java.lang.NegativeArraySizeException
>  at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76)
>  at org.apache.hadoop.hbase.client.Put.add(Put.java:315)
>  at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580)
>  at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734)
>  at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710)
>  at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146)
>  at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
>
> 2018-12-19 08:55:27,482 ERROR
> [RpcServer.FifoWFPBQ.default.handler=14,queue=2,port=60020]
> ipc.RpcServer: Unexpected throwable object
> java.lang.NegativeArraySizeException
>  at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76)
>  at org.apache.hadoop.hbase.client.Put.add(Put.java:315)
>  at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580)
>  at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734)
>  at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710)
>  at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146)
>  at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
>  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
>
> 2018-12-19 09:06:46,527 INFO [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 6491ms
> GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=6972ms
> 2018-12-19 09:07:04,207 INFO
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
> regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore
> missed its start time
> 2018-12-19 09:07:04,207 INFO
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
> regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker
> missed its start time
> 2018-12-19 09:07:04,207 INFO
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
> regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore:
> ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore
> missed its start time
> 2018-12-19 09:07:04,208 WARN [ResponseProcessor for block
> BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
> BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914
> java.io.EOFException: Premature EOF: no length prefix available
>  at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849)
> 2018-12-19 09:07:04,208 WARN
> [regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1]
> wal.MetricsWAL:
> regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1
> took 41693 ms appending an edit to wal; len~=439.1 K
> 2018-12-19 09:07:09,916 WARN [ResponseProcessor for block
> BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
> BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913
> java.io.EOFException: Premature EOF: no length prefix available
>  at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849)
> 2018-12-19 09:07:41,667 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 17182ms
> GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=17667ms
> 2018-12-19 09:07:41,667 INFO
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_4]
> regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore:
> ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore
> missed its start time
> 2018-12-19 09:07:41,667 INFO
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
> regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore
> missed its start time
> 2018-12-19 09:07:49,701 INFO
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_3]
> regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker
> missed its start time
> 2018-12-19 09:10:36,457 WARN [main] util.HeapMemorySizeUtil:
> hbase.regionserver.global.memstore.upperLimit is deprecated by
> hbase.regionserver.global.memstore.size
> 2018-12-19 09:10:38,993 INFO [main] zookeeper.ZooKeeper: Client
> environment:zookeeper.version=3.4.6-235--1, built on 10/30/2017 02:17
> GMT
> 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client
> environment:host.name=ip-10-130-157-159.us-west-2.compute.internal
> 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client
> environment:java.version=1.8.0_151
> 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client
> environment:java.vendor=Oracle Corporation
>
> 2018-12-15 15:02:12,875 WARN
> [RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020]
> ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020:
> caught a ClosedChannelException, this means that the server
> /10.130.146.146:60020 was processing a request but the client went
> away. The error message was: null
> 2018-12-15 15:02:13,824 ERROR
> [RpcServer.FifoWFPBQ.default.handler=18,queue=0,port=60020]
> ipc.RpcServer: Unexpected throwable object
> java.lang.IllegalArgumentException: offset (369668079) + length (8)
> exceed the capacity of the array: 1640976
>         at 
> org.apache.hadoop.hbase.util.Bytes.explainWrongLengthOrOffset(Bytes.java:631)
>         at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:605)
>         at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:590)
>         at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1241)
>         at org.apache.hadoop.hbase.KeyValue.toString(KeyValue.java:1181)
>         at org.apache.hadoop.hbase.client.Put.add(Put.java:321)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
>
>
>
> We initially suspected that some packet corruption must be happening
> over network but that doesn't seem to the case. However, we have found
> that few of the HBase clients which are doing Put operations (in
> batches) are running with OutOfMemory issues due to Direct buffer
> memory and were failing to write to HBase.
>
> Here are the logs from one of the HBase clients which had OutOfMemory errors.
>
> 2019-01-10 07:58:16.892 [HBase WRITE-3] INFO
> o.a.hadoop.hbase.client.AsyncProcess - Left over 9 task(s) are
> processed on server(s):
> [ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594,
> ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199,
> ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029,
> ip-10-130-158-137.us-west-2.compute.internal,60020,1545261163484,
> ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003,
> ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253,
> ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160]
> 2019-01-10 07:58:27.452 [HBase WRITE-2] INFO
> o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE
> 2019-01-10 07:58:27.503 [HBase WRITE-4] INFO
> o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE
> 2019-01-10 07:58:29.535 [htable-pool3231971-t2] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070503,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=239ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 239 -
> final failure
> 2019-01-10 07:58:29.540 [htable-pool3231971-t7] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070503,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=2567ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2567 -
> final failure
> 2019-01-10 07:58:29.542 [htable-pool3231971-t6] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070503,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=2694ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2694 -
> final failure
> 2019-01-10 07:58:29.542 [htable-pool3231971-t3] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070503,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=2820ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2820 -
> final failure
> 2019-01-10 07:58:29.550 [htable-pool3231972-t7] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070504,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=1620ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1620 -
> final failure
> 2019-01-10 07:58:29.551 [htable-pool3231971-t4] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070503,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=907ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 907 -
> final failure
> 2019-01-10 07:58:29.551 [htable-pool3231971-t8] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070503,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=2657ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2657 -
> final failure
> 2019-01-10 07:58:29.561 [htable-pool3231972-t8] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070504,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=1637ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1637 -
> final failure
> 2019-01-10 07:58:29.562 [htable-pool3231972-t2] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070504,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=140ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 140 -
> final failure
> 2019-01-10 07:58:29.563 [htable-pool3231972-t5] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070504,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=2769ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2769 -
> final failure
> 2019-01-10 07:58:29.564 [htable-pool3231972-t6] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070504,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=1623ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1623 -
> final failure
> 2019-01-10 07:58:29.565 [htable-pool3231972-t4] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070504,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=625ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 625 -
> final failure
> 2019-01-10 07:58:29.567 [htable-pool3231972-t3] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070504,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=1721ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1721 -
> final failure
> 2019-01-10 07:58:29.574 [htable-pool3231971-t5] WARN
> o.a.hadoop.hbase.client.AsyncProcess - #3070503,
> table=BLITZ_METRIC_NODE, attempt=5/5 failed=4782ops, last exception:
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.OutOfMemoryError: Direct buffer memory on
> ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029,
> tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 4782 -
> final failure
> 2019-01-10 07:58:29.629 [HBase WRITE-4] ERROR
> c.a.b.p.c.h.w.AbstractHBaseWriter - Failed 10135 actions: IOException:
> 10135 times,
>
>
> The errors on RegionServer seems to happen right around the time when
> clients are seeing the OutOfMemory errors and from the timestamps,
> it's not super clear to determine the cause vs effect.
>
>
> *Question:*
>
> Is it actually possible for the HBase client to send malformed/partial
> RPC messages to RS if the JVM is suffering any OutOfMemory issues due
> to less direct memory? It would be helpful to know the root cause here
> so that we can remediate the issue accordingly.
>
>
> These

Re: RegionServer is dying with memory issues after it receives malformed packets

Reply via email to