[jira] [Created] (ACCUMULO-3575) Accumulo GC ran out of memory

Keith Turner (JIRA) Tue, 10 Feb 2015 13:07:13 -0800

Keith Turner created ACCUMULO-3575:
--------------------------------------

             Summary: Accumulo GC ran out of memory
                 Key: ACCUMULO-3575
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3575
             Project: Accumulo
          Issue Type: Bug
    Affects Versions: 1.6.0
            Reporter: Keith Turner
            Priority: Minor



During CI run (w/ agitation) on 20 node EC2 cluster the Accumulo GC died with 
the following errors.

Following was in gc out file

{noformat}
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 20970"...
{noformat}

Following was in last lines of .log file

{noformat}
2015-02-10 20:19:03,255 [gc.SimpleGarbageCollector] INFO : Collect cycle took 
13.07 seconds
2015-02-10 20:19:03,258 [gc.SimpleGarbageCollector] INFO : Beginning garbage 
collection of write-ahead logs
2015-02-10 20:19:03,265 [zookeeper.ZooUtil] DEBUG: Trying to read instance id 
from hdfs://ip-10-1-2-11:9000/accumulo/instance_id
{noformat}

Restarted GC and same thing happened.   Looked in walog dir and saw there were 
333k walog.  This is the problem, the GC tries to read the list of files into 
memory.

{noformat}
$ hadoop fs -ls -R /accumulo/wal | wc
15/02/10 20:31:35 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
 333053 2664424 43629314
{noformat}

I suspect the reason there were so many walogs is because there were many many 
failure like the following (which resulted in 0 length walogs, only 199 of the 
333K have non-zero length).  The following error is from a tserver, which is 
probably a result of killing datanodes.

{noformat}
2015-02-10 03:45:00,447 [log.TabletServerLogger] ERROR: Unexpected error 
writing to log, retrying attempt 122
java.lang.RuntimeException: 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/accumulo/wal/ip-10-1-2-21+9997/9906de55-bc93-47f4-887c-4b9540fc3528 could only 
be replicated to 0 nodes instead of minReplication (=1).  There 
are 16 datanode(s) running and no node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

        at 
org.apache.accumulo.tserver.log.TabletServerLogger.createLoggers(TabletServerLogger.java:190)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.access$300(TabletServerLogger.java:53)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger$1.withWriteLock(TabletServerLogger.java:148)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.testLockAndRun(TabletServerLogger.java:115)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.initializeLoggers(TabletServerLogger.java:137)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:245)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:230)
        at 
org.apache.accumulo.tserver.log.TabletServerLogger.log(TabletServerLogger.java:345)
        at 
org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.update(TabletServer.java:1817)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.accumulo.trace.instrument.thrift.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46)
        at 
org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:47)
        at com.sun.proxy.$Proxy22.update(Unknown Source)
        at 
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2394)
        at 
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2378)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at 
org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168)
        at 
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
        at 
org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at 
org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
        at 
org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/accumulo/wal/ip-10-1-2-21+9997/9906de55-bc93-47f4-887c-4b9540fc3528 could only 
be replicated to 0 nodes instead of minReplication (=1).  There are 16 
datanode(s
) running and no node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

        at org.apache.hadoop.ipc.Client.call(Client.java:1468)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
        at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy21.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
{noformat}


Upped gc max mem from 256k to 2G and it ran ok.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (ACCUMULO-3575) Accumulo GC ran out of memory

Reply via email to