[jira] [Commented] (HBASE-14262) Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread
[ https://issues.apache.org/jira/browse/HBASE-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705530#comment-14705530 ] stack commented on HBASE-14262: --- bq. So I suggest we verify the jstack result first to see if we really create too many threads? Maybe we are reaching the precipice, so sometimes it fails and sometimes not... Looking. In TestDistributedLogReplay (I am looking at this test first because INFRA-10150 flags it as a zombie), we start 6 servers so unit test is running with 450+ threads. Let me dig in more (seems to be a shutdown issue here too). Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread Key: HBASE-14262 URL: https://issues.apache.org/jira/browse/HBASE-14262 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack The bit unit tests are coming in with OOME, can't create native threads. I was also getting the OOME locally running on MBP. git bisect got me to HBASE-13065, where we upped the test heap for TestDistributedLogSplitting back in feb. Around the time that this went in, we had similar OOME issues but then it was because we were doing 32bit JVMs. It does not seem to be the case here. A recent run failed all the below and most are OOME: {code} {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationEndpoint org.apache.hadoop.hbase.replication.TestPerTableCFReplication org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSCompressed org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpointNoMaster org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint org.apache.hadoop.hbase.replication.TestMasterReplication org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.zookeeper.TestZKLeaderManager org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable org.apache.hadoop.hbase.TestGlobalMemStoreSize org.apache.hadoop.hbase.wal.TestWALFiltering org.apache.hadoop.hbase.replication.TestReplicationSmallTests org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool org.apache.hadoop.hbase.replication.TestReplicationWithTags org.apache.hadoop.hbase.master.procedure.TestTruncateTableProcedure org.apache.hadoop.hbase.replication.TestReplicationChangingPeerRegionservers org.apache.hadoop.hbase.wal.TestDefaultWALProviderWithHLogKey org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush org.apache.hadoop.hbase.snapshot.TestMobRestoreFlushSnapshotFromClient org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure org.apache.hadoop.hbase.wal.TestWALFactory org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure org.apache.hadoop.hbase.replication.TestReplicationDisableInactivePeer org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure org.apache.hadoop.hbase.mapred.TestMultiTableSnapshotInputFormat org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure org.apache.hadoop.hbase.master.TestMasterFailoverBalancerPersistence org.apache.hadoop.hbase.TestStochasticBalancerJmxMetrics {color:red}-1 core zombie tests{color}. There are 16 zombie test(s): at org.apache.hadoop.hbase.security.visibility.TestVisibilityLabels.testVisibilityLabelsWithComplexLabels(TestVisibilityLabels.java:216) at org.apache.hadoop.hbase.mapred.TestTableInputFormat.testTableRecordReaderScannerFail(TestTableInputFormat.java:281) at org.apache.hadoop.hbase.replication.TestMultiSlaveReplication.testMultiSlaveReplication(TestMultiSlaveReplication.java:129) at
[jira] [Commented] (HBASE-14262) Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread
[ https://issues.apache.org/jira/browse/HBASE-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703993#comment-14703993 ] stack commented on HBASE-14262: --- Above happened when we ran on: Building remotely on H9 (Mapreduce Falcon Hadoop Pig Zookeeper Tez Hdfs) in workspace /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build ... /home/jenkins/jenkins-slave/tools/hudson.model.JDK/jdk-1.7u51/bin/java java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) /home/jenkins/jenkins-slave/tools/hudson.model.JDK/jdk-1.7u51 java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) ... Here is the link: https://builds.apache.org/job/PreCommit-HBASE-Build/15169//consoleFull Also here on H1: https://builds.apache.org/view/H-L/view/HBase/job/PreCommit-HBASE-Build/15167/consoleFull ... and on H6: https://builds.apache.org/view/H-L/view/HBase/job/PreCommit-HBASE-Build/15164/consoleFull Something has changed. Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread Key: HBASE-14262 URL: https://issues.apache.org/jira/browse/HBASE-14262 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack The bit unit tests are coming in with OOME, can't create native threads. I was also getting the OOME locally running on MBP. git bisect got me to HBASE-13065, where we upped the test heap for TestDistributedLogSplitting back in feb. Around the time that this went in, we had similar OOME issues but then it was because we were doing 32bit JVMs. It does not seem to be the case here. A recent run failed all the below and most are OOME: {code} {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationEndpoint org.apache.hadoop.hbase.replication.TestPerTableCFReplication org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSCompressed org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpointNoMaster org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint org.apache.hadoop.hbase.replication.TestMasterReplication org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.zookeeper.TestZKLeaderManager org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable org.apache.hadoop.hbase.TestGlobalMemStoreSize org.apache.hadoop.hbase.wal.TestWALFiltering org.apache.hadoop.hbase.replication.TestReplicationSmallTests org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool org.apache.hadoop.hbase.replication.TestReplicationWithTags org.apache.hadoop.hbase.master.procedure.TestTruncateTableProcedure org.apache.hadoop.hbase.replication.TestReplicationChangingPeerRegionservers org.apache.hadoop.hbase.wal.TestDefaultWALProviderWithHLogKey org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush org.apache.hadoop.hbase.snapshot.TestMobRestoreFlushSnapshotFromClient org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure org.apache.hadoop.hbase.wal.TestWALFactory org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure org.apache.hadoop.hbase.replication.TestReplicationDisableInactivePeer org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure org.apache.hadoop.hbase.mapred.TestMultiTableSnapshotInputFormat org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure org.apache.hadoop.hbase.master.TestMasterFailoverBalancerPersistence org.apache.hadoop.hbase.TestStochasticBalancerJmxMetrics {color:red}-1 core zombie tests{color}. There are 16 zombie test(s):
[jira] [Commented] (HBASE-14262) Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread
[ https://issues.apache.org/jira/browse/HBASE-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704067#comment-14704067 ] stack commented on HBASE-14262: --- H9 ran all tests w/o OOME here: https://builds.apache.org/view/H-L/view/HBase/job/PreCommit-HBASE-Build/15125/console Aug 17, 2015 7:14:53 AM Here is where the OOMEs start after going back a ways. Failed Console Output #15134 Aug 18, 2015 5:10:02 AM Seems to be precommit only. Not seeing it in trunk builds but haven't seen a machine overlap between precommit and trunk just yet (and trunk build not running often enough). If I revert HBASE-13065, TestAcidGuarantees passes on my local mac... Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread Key: HBASE-14262 URL: https://issues.apache.org/jira/browse/HBASE-14262 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack The bit unit tests are coming in with OOME, can't create native threads. I was also getting the OOME locally running on MBP. git bisect got me to HBASE-13065, where we upped the test heap for TestDistributedLogSplitting back in feb. Around the time that this went in, we had similar OOME issues but then it was because we were doing 32bit JVMs. It does not seem to be the case here. A recent run failed all the below and most are OOME: {code} {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationEndpoint org.apache.hadoop.hbase.replication.TestPerTableCFReplication org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSCompressed org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpointNoMaster org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint org.apache.hadoop.hbase.replication.TestMasterReplication org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.zookeeper.TestZKLeaderManager org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable org.apache.hadoop.hbase.TestGlobalMemStoreSize org.apache.hadoop.hbase.wal.TestWALFiltering org.apache.hadoop.hbase.replication.TestReplicationSmallTests org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool org.apache.hadoop.hbase.replication.TestReplicationWithTags org.apache.hadoop.hbase.master.procedure.TestTruncateTableProcedure org.apache.hadoop.hbase.replication.TestReplicationChangingPeerRegionservers org.apache.hadoop.hbase.wal.TestDefaultWALProviderWithHLogKey org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush org.apache.hadoop.hbase.snapshot.TestMobRestoreFlushSnapshotFromClient org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure org.apache.hadoop.hbase.wal.TestWALFactory org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure org.apache.hadoop.hbase.replication.TestReplicationDisableInactivePeer org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure org.apache.hadoop.hbase.mapred.TestMultiTableSnapshotInputFormat org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure org.apache.hadoop.hbase.master.TestMasterFailoverBalancerPersistence org.apache.hadoop.hbase.TestStochasticBalancerJmxMetrics {color:red}-1 core zombie tests{color}. There are 16 zombie test(s): at org.apache.hadoop.hbase.security.visibility.TestVisibilityLabels.testVisibilityLabelsWithComplexLabels(TestVisibilityLabels.java:216) at org.apache.hadoop.hbase.mapred.TestTableInputFormat.testTableRecordReaderScannerFail(TestTableInputFormat.java:281) at org.apache.hadoop.hbase.replication.TestMultiSlaveReplication.testMultiSlaveReplication(TestMultiSlaveReplication.java:129)
[jira] [Commented] (HBASE-14262) Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread
[ https://issues.apache.org/jira/browse/HBASE-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704287#comment-14704287 ] Heng Chen commented on HBASE-14262: --- Agree with [~Apache9], we can use BlockingQueue in {{TestContext}} instead of {{SetTestThread testThreads}}, then we use a threads pool with less thread number, which pick task from BlockingQueue to run. Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread Key: HBASE-14262 URL: https://issues.apache.org/jira/browse/HBASE-14262 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack The bit unit tests are coming in with OOME, can't create native threads. I was also getting the OOME locally running on MBP. git bisect got me to HBASE-13065, where we upped the test heap for TestDistributedLogSplitting back in feb. Around the time that this went in, we had similar OOME issues but then it was because we were doing 32bit JVMs. It does not seem to be the case here. A recent run failed all the below and most are OOME: {code} {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationEndpoint org.apache.hadoop.hbase.replication.TestPerTableCFReplication org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSCompressed org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpointNoMaster org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint org.apache.hadoop.hbase.replication.TestMasterReplication org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.zookeeper.TestZKLeaderManager org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable org.apache.hadoop.hbase.TestGlobalMemStoreSize org.apache.hadoop.hbase.wal.TestWALFiltering org.apache.hadoop.hbase.replication.TestReplicationSmallTests org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool org.apache.hadoop.hbase.replication.TestReplicationWithTags org.apache.hadoop.hbase.master.procedure.TestTruncateTableProcedure org.apache.hadoop.hbase.replication.TestReplicationChangingPeerRegionservers org.apache.hadoop.hbase.wal.TestDefaultWALProviderWithHLogKey org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush org.apache.hadoop.hbase.snapshot.TestMobRestoreFlushSnapshotFromClient org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure org.apache.hadoop.hbase.wal.TestWALFactory org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure org.apache.hadoop.hbase.replication.TestReplicationDisableInactivePeer org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure org.apache.hadoop.hbase.mapred.TestMultiTableSnapshotInputFormat org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure org.apache.hadoop.hbase.master.TestMasterFailoverBalancerPersistence org.apache.hadoop.hbase.TestStochasticBalancerJmxMetrics {color:red}-1 core zombie tests{color}. There are 16 zombie test(s): at org.apache.hadoop.hbase.security.visibility.TestVisibilityLabels.testVisibilityLabelsWithComplexLabels(TestVisibilityLabels.java:216) at org.apache.hadoop.hbase.mapred.TestTableInputFormat.testTableRecordReaderScannerFail(TestTableInputFormat.java:281) at org.apache.hadoop.hbase.replication.TestMultiSlaveReplication.testMultiSlaveReplication(TestMultiSlaveReplication.java:129) at org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileScanning(TestHRegion.java:3799) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14262) Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread
[ https://issues.apache.org/jira/browse/HBASE-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704233#comment-14704233 ] Duo Zhang commented on HBASE-14262: --- {quote} If I revert HBASE-13065, TestAcidGuarantees passes on my local mac... {quote} This could happen since reduce heap size also means increase native memory size we could use so we can create more threads... And I used to deal with TestAcidGuarantees when the new AsyncRpcClient is introduced. I remember that finally I create a static EventLoopGroup and make all AsyncRpcClient share it to reduce thread number... So I suggest we verify the jstack result first to see if we really create too many threads? Maybe we are reaching the precipice, so sometimes it fails and sometimes not... Thanks. Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread Key: HBASE-14262 URL: https://issues.apache.org/jira/browse/HBASE-14262 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack The bit unit tests are coming in with OOME, can't create native threads. I was also getting the OOME locally running on MBP. git bisect got me to HBASE-13065, where we upped the test heap for TestDistributedLogSplitting back in feb. Around the time that this went in, we had similar OOME issues but then it was because we were doing 32bit JVMs. It does not seem to be the case here. A recent run failed all the below and most are OOME: {code} {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationEndpoint org.apache.hadoop.hbase.replication.TestPerTableCFReplication org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSCompressed org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpointNoMaster org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint org.apache.hadoop.hbase.replication.TestMasterReplication org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.zookeeper.TestZKLeaderManager org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable org.apache.hadoop.hbase.TestGlobalMemStoreSize org.apache.hadoop.hbase.wal.TestWALFiltering org.apache.hadoop.hbase.replication.TestReplicationSmallTests org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool org.apache.hadoop.hbase.replication.TestReplicationWithTags org.apache.hadoop.hbase.master.procedure.TestTruncateTableProcedure org.apache.hadoop.hbase.replication.TestReplicationChangingPeerRegionservers org.apache.hadoop.hbase.wal.TestDefaultWALProviderWithHLogKey org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush org.apache.hadoop.hbase.snapshot.TestMobRestoreFlushSnapshotFromClient org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure org.apache.hadoop.hbase.wal.TestWALFactory org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure org.apache.hadoop.hbase.replication.TestReplicationDisableInactivePeer org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure org.apache.hadoop.hbase.mapred.TestMultiTableSnapshotInputFormat org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure org.apache.hadoop.hbase.master.TestMasterFailoverBalancerPersistence org.apache.hadoop.hbase.TestStochasticBalancerJmxMetrics {color:red}-1 core zombie tests{color}. There are 16 zombie test(s): at org.apache.hadoop.hbase.security.visibility.TestVisibilityLabels.testVisibilityLabelsWithComplexLabels(TestVisibilityLabels.java:216) at org.apache.hadoop.hbase.mapred.TestTableInputFormat.testTableRecordReaderScannerFail(TestTableInputFormat.java:281) at