[jira] [Commented] (HBASE-7982) TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes?

Jeffrey Zhong (JIRA) Sun, 03 Mar 2013 14:03:13 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591881#comment-13591881
 ]


Jeffrey Zhong commented on HBASE-7982:
--------------------------------------

I checked the error log which wasn't available now and found the test case 
failed due to "utility1.loadTable" failing. It's the same error we got from job 
6 
https://builds.apache.org/job/hbase-0.95/6/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/

For the NPE, it is an old issue and uncovered by my recent check in where we 
set internal reader to null after we close the reader inside function 
repLogReader.closeReader(). The root cause is that we don't reset this.reader 
when we close repLogReader. Since the NPE is triggered when "nothing to 
replicate", the test case passes even with the exception. 

Both variables are pointing at the same object. We should not have this.reader 
variable in the first place to save the effort to keep them in sync.
{code}
      } finally {
        try {
          this.repLogReader.closeReader();
        } catch (IOException e) {
          gotIOE = true;
          LOG.warn("Unable to finalize the tailing of a file", e);
        }
      }
{code}

I attached a patch to handle the NPE case and will spend more time on the table 
loading issue.
                
> TestReplicationQueueFailover* runs for a minute, spews 3/4million lines 
> complaining 'Filesystem closed', has an NPE, and still passes?
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7982
>                 URL: https://issues.apache.org/jira/browse/HBASE-7982
>             Project: HBase
>          Issue Type: Bug
>          Components: build
>            Reporter: stack
>            Priority: Blocker
>
> I was trying to look at why the odd time Hudson OOMEs trying to make a report 
> on 0.95 build #4 https://builds.apache.org/job/hbase-0.95/4/console:
> {code}
> ERROR: Failed to archive test reports
> hudson.util.IOException2: remote file operation failed: 
> /home/jenkins/jenkins-slave/workspace/hbase-0.95 at 
> hudson.remoting.Channel@151a4e3e:ubuntu3
>       at hudson.FilePath.act(FilePath.java:861)
>       at hudson.FilePath.act(FilePath.java:838)
>       at hudson.tasks.junit.JUnitParser.parse(JUnitParser.java:87)
>       at 
> ...
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
>       at java.nio.CharBuffer.allocate(CharBuffer.java:329)
>       at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
>       at java.nio.charset.Charset.decode(Charset.java:791)
>       at hudson.tasks.junit.SuiteResult.<init>(SuiteResult.java:215)
> ...
> {code}
> We are trying to allocate a big buffer and failing.
> Looking at reports being generated, we have quite a few that are > 10MB in 
> size:
> {code}
> durruti:0.95 stack$ find hbase-* -type f -size +10000k -exec ls -la {} \;
> -rw-r--r--@ 1 stack  staff  11126492 Feb 27 06:14 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.backup.TestHFileArchiving-output.txt
> -rw-r--r--@ 1 stack  staff  13296009 Feb 27 05:47 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestFromClientSide3-output.txt
> -rw-r--r--@ 1 stack  staff  10541898 Feb 27 05:47 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestMultiParallel-output.txt
> -rw-r--r--@ 1 stack  staff  25344601 Feb 27 05:51 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient-output.txt
> -rw-r--r--@ 1 stack  staff  17966969 Feb 27 06:12 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction-output.txt
> -rw-r--r--@ 1 stack  staff  17699068 Feb 27 06:09 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit-output.txt
> -rw-r--r--@ 1 stack  staff  17701832 Feb 27 06:07 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed-output.txt
> -rw-r--r--@ 1 stack  staff  717853709 Feb 27 06:17 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailover-output.txt
> -rw-r--r--@ 1 stack  staff  563616793 Feb 27 06:17 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed-output.txt
> {code}
> ... with TestReplicationQueueFailover* being order of magnitude bigger than 
> the others.
> Looking in the test I see both spewing between 800 and 900 thousand lines in 
> about a minute.  Here is their fixation:
> {code}
> 8908998 2013-02-27 06:17:48,176 ERROR 
> [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] 
> wal.FSHLog$LogSyncer(1012): Error while syncing, requesting close of hlog.
> 8908999 java.io.IOException: Filesystem closed
> 8909000 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319)
> 8909001 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78)
> 8909002 ,...at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843)
> 8909003 ,...at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> 8909004 ,...at 
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999)
> 8909005 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248)
> 8909006 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120)
> 8909007 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058)
> 8909008 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228)
> 8909009 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010)
> 8909010 ,...at java.lang.Thread.run(Thread.java:722)
> 8909011 2013-02-27 06:17:48,176 FATAL 
> [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] 
> wal.FSHLog(1140): Could not sync. Requesting close of hlog
> 8909012 java.io.IOException: Filesystem closed
> 8909013 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319)
> 8909014 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78)
> 8909015 ,...at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843)
> 8909016 ,...at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> 8909017 ,...at 
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999)
> 8909018 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248)
> 8909019 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120)
> 8909020 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058)
> 8909021 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228)
> 8909022 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010)
> 8909023 ,...at java.lang.Thread.run(Thread.java:722)
> ...
> {code}
> These tests are 'succeeding'?
> I also see in both:
> {code}
>    3891 java.lang.NullPointerException
>    3892 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.seek(SequenceFileLogReader.java:261)
>    3893 ,...at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.seek(ReplicationHLogReaderManager.java:103)
>    3894 ,...at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:415)
>    3895 ,...at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:333)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7982) TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes?

Reply via email to