[ https://issues.apache.org/jira/browse/HBASE-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594088#comment-13594088 ]
stack commented on HBASE-7982: ------------------------------ The OOME happened in the recent https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.95/26/console > TestReplicationQueueFailover* runs for a minute, spews 3/4million lines > complaining 'Filesystem closed', has an NPE, and still passes? > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-7982 > URL: https://issues.apache.org/jira/browse/HBASE-7982 > Project: HBase > Issue Type: Bug > Components: build > Reporter: stack > Assignee: Jeffrey Zhong > Priority: Blocker > Attachments: hbase-7982-combined.patch, > hbase-7982-huge-logging.patch, hbase-7982-NPE_2.patch, hbase-7982-NPE.patch > > > I was trying to look at why the odd time Hudson OOMEs trying to make a report > on 0.95 build #4 https://builds.apache.org/job/hbase-0.95/4/console: > {code} > ERROR: Failed to archive test reports > hudson.util.IOException2: remote file operation failed: > /home/jenkins/jenkins-slave/workspace/hbase-0.95 at > hudson.remoting.Channel@151a4e3e:ubuntu3 > at hudson.FilePath.act(FilePath.java:861) > at hudson.FilePath.act(FilePath.java:838) > at hudson.tasks.junit.JUnitParser.parse(JUnitParser.java:87) > at > ... > Caused by: java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57) > at java.nio.CharBuffer.allocate(CharBuffer.java:329) > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792) > at java.nio.charset.Charset.decode(Charset.java:791) > at hudson.tasks.junit.SuiteResult.<init>(SuiteResult.java:215) > ... > {code} > We are trying to allocate a big buffer and failing. > Looking at reports being generated, we have quite a few that are > 10MB in > size: > {code} > durruti:0.95 stack$ find hbase-* -type f -size +10000k -exec ls -la {} \; > -rw-r--r--@ 1 stack staff 11126492 Feb 27 06:14 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.backup.TestHFileArchiving-output.txt > -rw-r--r--@ 1 stack staff 13296009 Feb 27 05:47 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestFromClientSide3-output.txt > -rw-r--r--@ 1 stack staff 10541898 Feb 27 05:47 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestMultiParallel-output.txt > -rw-r--r--@ 1 stack staff 25344601 Feb 27 05:51 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient-output.txt > -rw-r--r--@ 1 stack staff 17966969 Feb 27 06:12 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction-output.txt > -rw-r--r--@ 1 stack staff 17699068 Feb 27 06:09 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit-output.txt > -rw-r--r--@ 1 stack staff 17701832 Feb 27 06:07 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed-output.txt > -rw-r--r--@ 1 stack staff 717853709 Feb 27 06:17 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailover-output.txt > -rw-r--r--@ 1 stack staff 563616793 Feb 27 06:17 > hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed-output.txt > {code} > ... with TestReplicationQueueFailover* being order of magnitude bigger than > the others. > Looking in the test I see both spewing between 800 and 900 thousand lines in > about a minute. Here is their fixation: > {code} > 8908998 2013-02-27 06:17:48,176 ERROR > [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] > wal.FSHLog$LogSyncer(1012): Error while syncing, requesting close of hlog. > 8908999 java.io.IOException: Filesystem closed > 8909000 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319) > 8909001 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) > 8909002 ,...at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843) > 8909003 ,...at > org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > 8909004 ,...at > org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999) > 8909005 ,...at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248) > 8909006 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120) > 8909007 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058) > 8909008 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228) > 8909009 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010) > 8909010 ,...at java.lang.Thread.run(Thread.java:722) > 8909011 2013-02-27 06:17:48,176 FATAL > [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] > wal.FSHLog(1140): Could not sync. Requesting close of hlog > 8909012 java.io.IOException: Filesystem closed > 8909013 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319) > 8909014 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) > 8909015 ,...at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843) > 8909016 ,...at > org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > 8909017 ,...at > org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999) > 8909018 ,...at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248) > 8909019 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120) > 8909020 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058) > 8909021 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228) > 8909022 ,...at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010) > 8909023 ,...at java.lang.Thread.run(Thread.java:722) > ... > {code} > These tests are 'succeeding'? > I also see in both: > {code} > 3891 java.lang.NullPointerException > 3892 ,...at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.seek(SequenceFileLogReader.java:261) > 3893 ,...at > org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.seek(ReplicationHLogReaderManager.java:103) > 3894 ,...at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:415) > 3895 ,...at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:333) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira