[jira] [Updated] (HDFS-11272) Missing whitespace in the message of TestFSDirAttrOp.java
[ https://issues.apache.org/jira/browse/HDFS-11272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11272: --- Assignee: Jimmy Xiang Affects Version/s: 2.8.0 Status: Patch Available (was: Open) > Missing whitespace in the message of TestFSDirAttrOp.java > - > > Key: HDFS-11272 > URL: https://issues.apache.org/jira/browse/HDFS-11272 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 2.8.0 >Reporter: Akira Ajisaka >Assignee: Jimmy Xiang >Priority: Trivial > Labels: newbie > Attachments: hdfs-11272.patch > > > {code:title=TestFSDirAttrOp.java} > assertFalse("SetTimes should not update access time" > + "because it's within the last precision interval", > {code} > A whitespace is missing between time and because. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11272) Missing whitespace in the message of TestFSDirAttrOp.java
[ https://issues.apache.org/jira/browse/HDFS-11272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11272: --- Attachment: hdfs-11272.patch > Missing whitespace in the message of TestFSDirAttrOp.java > - > > Key: HDFS-11272 > URL: https://issues.apache.org/jira/browse/HDFS-11272 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 2.8.0 >Reporter: Akira Ajisaka >Priority: Trivial > Labels: newbie > Attachments: hdfs-11272.patch > > > {code:title=TestFSDirAttrOp.java} > assertFalse("SetTimes should not update access time" > + "because it's within the last precision interval", > {code} > A whitespace is missing between time and because. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11258) File mtime change could not save to editlog
[ https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770512#comment-15770512 ] Jimmy Xiang commented on HDFS-11258: The addendum is good for both branch 2 and 2.8. > File mtime change could not save to editlog > --- > > Key: HDFS-11258 > URL: https://issues.apache.org/jira/browse/HDFS-11258 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Critical > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: hdfs-11258-addendum-branch2.patch, hdfs-11258.1.patch, > hdfs-11258.2.patch, hdfs-11258.3.patch, hdfs-11258.4.patch > > > When both mtime and atime are changed, and atime is not beyond the precision > limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog
[ https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11258: --- Attachment: hdfs-11258-addendum-branch2.patch Attached an addendum for branch-2 to fix compilation. > File mtime change could not save to editlog > --- > > Key: HDFS-11258 > URL: https://issues.apache.org/jira/browse/HDFS-11258 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Critical > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: hdfs-11258-addendum-branch2.patch, hdfs-11258.1.patch, > hdfs-11258.2.patch, hdfs-11258.3.patch, hdfs-11258.4.patch > > > When both mtime and atime are changed, and atime is not beyond the precision > limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog
[ https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11258: --- Attachment: hdfs-11258.4.patch > File mtime change could not save to editlog > --- > > Key: HDFS-11258 > URL: https://issues.apache.org/jira/browse/HDFS-11258 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Critical > Attachments: hdfs-11258.1.patch, hdfs-11258.2.patch, > hdfs-11258.3.patch, hdfs-11258.4.patch > > > When both mtime and atime are changed, and atime is not beyond the precision > limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog
[ https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11258: --- Attachment: hdfs-11258.3.patch Thanks [~wheat9] for the review. Fixed the checkstyles. > File mtime change could not save to editlog > --- > > Key: HDFS-11258 > URL: https://issues.apache.org/jira/browse/HDFS-11258 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Critical > Attachments: hdfs-11258.1.patch, hdfs-11258.2.patch, > hdfs-11258.3.patch > > > When both mtime and atime are changed, and atime is not beyond the precision > limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog
[ https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11258: --- Attachment: hdfs-11258.2.patch Patch v2, added a unit test. > File mtime change could not save to editlog > --- > > Key: HDFS-11258 > URL: https://issues.apache.org/jira/browse/HDFS-11258 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Attachments: hdfs-11258.1.patch, hdfs-11258.2.patch > > > When both mtime and atime are changed, and atime is not beyond the precision > limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog
[ https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11258: --- Status: Patch Available (was: Open) > File mtime change could not save to editlog > --- > > Key: HDFS-11258 > URL: https://issues.apache.org/jira/browse/HDFS-11258 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Attachments: hdfs-11258.1.patch > > > When both mtime and atime are changed, and atime is not beyond the precision > limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog
[ https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-11258: --- Attachment: hdfs-11258.1.patch > File mtime change could not save to editlog > --- > > Key: HDFS-11258 > URL: https://issues.apache.org/jira/browse/HDFS-11258 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Attachments: hdfs-11258.1.patch > > > When both mtime and atime are changed, and atime is not beyond the precision > limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11258) File mtime change could not save to editlog
Jimmy Xiang created HDFS-11258: -- Summary: File mtime change could not save to editlog Key: HDFS-11258 URL: https://issues.apache.org/jira/browse/HDFS-11258 Project: Hadoop HDFS Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor When both mtime and atime are changed, and atime is not beyond the precision limit, the mtime change is not saved to edit logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Assignee: (was: Jimmy Xiang) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch, hdfs-4239_v5.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4284) BlockReaderLocal not notified of failed disks
[ https://issues.apache.org/jira/browse/HDFS-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4284: -- Assignee: (was: Jimmy Xiang) BlockReaderLocal not notified of failed disks - Key: HDFS-4284 URL: https://issues.apache.org/jira/browse/HDFS-4284 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 3.0.0, 2.0.2-alpha Reporter: Andy Isaacson When a DN marks a disk as bad, it stops using replicas on that disk. However a long-running {{BlockReaderLocal}} instance will continue to access replicas on the failing disk. Somehow we should let the in-client BlockReaderLocal know that a disk has been marked as bad so that it can stop reading from the bad disk. From HDFS-4239: bq. To rephrase that, a long running BlockReaderLocal will ride over local DN restarts and disk ejections. We had to drain the RS of all its regions in order to stop it from using the bad disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Open (was: Patch Available) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch, hdfs-4239_v5.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123715#comment-14123715 ] Jimmy Xiang commented on HDFS-4239: --- Sure. Assigned it to you. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Yongjun Zhang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch, hdfs-4239_v5.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Assignee: Yongjun Zhang Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Yongjun Zhang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch, hdfs-4239_v5.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897069#comment-13897069 ] Jimmy Xiang commented on HDFS-4239: --- Ping. Can anyone take a look patch v4? Thanks. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch, hdfs-4239_v5.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892424#comment-13892424 ] Jimmy Xiang commented on HDFS-5882: --- Ok, let me take a look at that. TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5882: -- Status: Open (was: Patch Available) TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5882: -- Status: Patch Available (was: Open) TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch, hdfs-5882_v2.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5882: -- Attachment: hdfs-5882_v2.patch TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch, hdfs-5882_v2.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892544#comment-13892544 ] Jimmy Xiang commented on HDFS-5882: --- Posted v2 that force-flush the logs by closing the appenders. The flaky is mostly for the async-appender. Closing it will flush the logs. TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch, hdfs-5882_v2.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Open (was: Patch Available) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5882) TestAuditLogs is flaky
Jimmy Xiang created HDFS-5882: - Summary: TestAuditLogs is flaky Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Priority: Minor TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Attachment: hdfs-4239_v5.patch Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch, hdfs-4239_v5.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Patch Available (was: Open) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch, hdfs-4239_v5.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5883) TestZKPermissionsWatcher.testPermissionsWatcher fails sometimes
Jimmy Xiang created HDFS-5883: - Summary: TestZKPermissionsWatcher.testPermissionsWatcher fails sometimes Key: HDFS-5883 URL: https://issues.apache.org/jira/browse/HDFS-5883 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Trivial It looks like sleeping 100 ms is not enough for the permission change to propagate to other watchers. Will increase the sleeping time a little. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HDFS-5882: - Assignee: Jimmy Xiang TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5882: -- Attachment: hdfs-5882.patch TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5882: -- Status: Patch Available (was: Open) TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5882) TestAuditLogs is flaky
[ https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891531#comment-13891531 ] Jimmy Xiang commented on HDFS-5882: --- I was thinking to force-flush the logger to disk too, but there isn't an easy way. With the current patch, I don't see the problem any more locally. TestAuditLogs is flaky -- Key: HDFS-5882 URL: https://issues.apache.org/jira/browse/HDFS-5882 Project: Hadoop HDFS Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hdfs-5882.patch TestAuditLogs fails sometimes: {noformat} Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) Time elapsed: 2.085 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:92) at org.junit.Assert.assertTrue(Assert.java:43) at org.junit.Assert.assertNotNull(Assert.java:526) at org.junit.Assert.assertNotNull(Assert.java:537) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295) at org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Open (was: Patch Available) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889841#comment-13889841 ] Jimmy Xiang commented on HDFS-4239: --- Good point. Let me handle the access control in next patch. As to blacklisted volume IDs, can we handle it in a separate issue? Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Patch Available (was: Open) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Attachment: hdfs-4239_v4.patch Attached v4 that added access control if security is enabled. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, hdfs-4239_v4.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887291#comment-13887291 ] Jimmy Xiang commented on HDFS-4239: --- This test failure is not related. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Open (was: Patch Available) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Attachment: hdfs-4239_v3.patch Attached v3 that fixed the test failures. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Patch Available (was: Open) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884897#comment-13884897 ] Jimmy Xiang commented on HDFS-4239: --- Cool. I agree. Attached v2 that released all references to the volume marked down. In my test, I don't see any open file descriptor pointing to the volume marked down. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Attachment: hdfs-4239_v2.patch Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Patch Available (was: Open) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch, hdfs-4239_v2.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (HDFS-4284) BlockReaderLocal not notified of failed disks
[ https://issues.apache.org/jira/browse/HDFS-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HDFS-4284: - Assignee: Jimmy Xiang BlockReaderLocal not notified of failed disks - Key: HDFS-4284 URL: https://issues.apache.org/jira/browse/HDFS-4284 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 3.0.0, 2.0.2-alpha Reporter: Andy Isaacson Assignee: Jimmy Xiang When a DN marks a disk as bad, it stops using replicas on that disk. However a long-running {{BlockReaderLocal}} instance will continue to access replicas on the failing disk. Somehow we should let the in-client BlockReaderLocal know that a disk has been marked as bad so that it can stop reading from the bad disk. From HDFS-4239: bq. To rephrase that, a long running BlockReaderLocal will ride over local DN restarts and disk ejections. We had to drain the RS of all its regions in order to stop it from using the bad disk. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883728#comment-13883728 ] Jimmy Xiang commented on HDFS-4239: --- We can release the lock after the volume is marked down. No new block will be allocated to this volume. How about those blocks on this volume being writing? The writing could take forever, for example, a rarely updated HLog file. I was thinking to fail the writing pipeline so that the client can set up another pipeline. Any problem with that? Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Open (was: Patch Available) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875538#comment-13875538 ] Jimmy Xiang commented on HDFS-4239: --- File 'in_use.lock' is still there after the volume is marked down. Let me take another look. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Attachment: hdfs-4239.patch Attached a path for trunk. It's good for branch 2 too. Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4239: -- Status: Patch Available (was: Open) Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang Attachments: hdfs-4239.patch If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HDFS-4239: - Assignee: Jimmy Xiang Means of telling the datanode to stop using a sick disk --- Key: HDFS-4239 URL: https://issues.apache.org/jira/browse/HDFS-4239 Project: Hadoop HDFS Issue Type: Improvement Reporter: stack Assignee: Jimmy Xiang If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing occasionally, or just exhibiting high latency -- your choices are: 1. Decommission the total datanode. If the datanode is carrying 6 or 12 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- the rereplication of the downed datanode's data can be pretty disruptive, especially if the cluster is doing low latency serving: e.g. hosting an hbase cluster. 2. Stop the datanode, unmount the bad disk, and restart the datanode (You can't unmount the disk while it is in use). This latter is better in that only the bad disk's data is rereplicated, not all datanode data. Is it possible to do better, say, send the datanode a signal to tell it stop using a disk an operator has designated 'bad'. This would be like option #2 above minus the need to stop and restart the datanode. Ideally the disk would become unmountable after a while. Nice to have would be being able to tell the datanode to restart using a disk after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5220: -- Attachment: 2.4-5220.patch Thanks a lot, [~cmccabe]. I attached 2.4-5220.patch for 2.4. Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.4.0 Reporter: Rob Weltman Assignee: Jimmy Xiang Fix For: 3.0.0 Attachments: 2.4-5220.patch, hdfs-5220.patch, hdfs-5220_v2.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863451#comment-13863451 ] Jimmy Xiang commented on HDFS-5220: --- The test is a little flaky. It depends on the execution order. Let me fix it. Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.4.0 Reporter: Rob Weltman Assignee: Jimmy Xiang Fix For: 2.4.0 Attachments: 2.4-5220.patch, hdfs-5220.patch, hdfs-5220_v2.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5220: -- Attachment: hdfs-5220.addendum 2.4-5220.addendum If TestUserGroupInformation#testGetServerSideGroups() runs first, TestUserGroupInformation#testLogin will fail. Attached two addendum to change the group verification a little. Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.4.0 Reporter: Rob Weltman Assignee: Jimmy Xiang Fix For: 2.4.0 Attachments: 2.4-5220.addendum, 2.4-5220.patch, hdfs-5220.addendum, hdfs-5220.patch, hdfs-5220_v2.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863500#comment-13863500 ] Jimmy Xiang commented on HDFS-5220: --- Let's use a separate jira instead: HADOOP-10207. Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.4.0 Reporter: Rob Weltman Assignee: Jimmy Xiang Fix For: 2.4.0 Attachments: 2.4-5220.addendum, 2.4-5220.patch, hdfs-5220.addendum, hdfs-5220.patch, hdfs-5220_v2.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5220: -- Attachment: hdfs-5220_v2.patch Attached v2 that initialize and populate the quantile metrics in UserGroupInformation.setConfiguration, so that we don't change the UserGroupInformation interface. Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Attachments: hdfs-5220.patch, hdfs-5220_v2.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5220: -- Status: Patch Available (was: Open) Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Attachments: hdfs-5220.patch, hdfs-5220_v2.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Assigned] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HDFS-5220: - Assignee: Jimmy Xiang Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5220: -- Attachment: hdfs-5220.patch Add the first version patch. One drawback is that the quantile info is in a different metrics from the rate info. The other solution is to add a percentile configuration to common, only if it is set, we initialize and populate the quantile metrics. Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Attachments: hdfs-5220.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5220: -- Status: Patch Available (was: Open) Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Attachments: hdfs-5220.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5220: -- Status: Open (was: Patch Available) Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Attachments: hdfs-5220.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5685) DistCp will fail to copy with -delete switch
[ https://issues.apache.org/jira/browse/HDFS-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852469#comment-13852469 ] Jimmy Xiang commented on HDFS-5685: --- Do we know why it can't find the file? Is it because the file is already copied by a failed task? DistCp will fail to copy with -delete switch Key: HDFS-5685 URL: https://issues.apache.org/jira/browse/HDFS-5685 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 1.0.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang When using distcp command to copy files with -delete switch, running as user xyz, hadoop distcp -p -i -update -delete hdfs://srchost:port/user hdfs://dsthost:port/user It fails with the following exception: Copy failed: java.io.FileNotFoundException: File does not exist: hdfs://dsthost:port/user/xyz/.stagingdistcp_urjb0g/_distcp_src_files at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:557) at org.apache.hadoop.tools.DistCp$CopyInputFormat.getSplits(DistCp.java:266) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:667) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5685) DistCp will fail to copy with -delete switch
[ https://issues.apache.org/jira/browse/HDFS-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852473#comment-13852473 ] Jimmy Xiang commented on HDFS-5685: --- Looks like it's a Hadoop/MR instead of HDFS issue. DistCp will fail to copy with -delete switch Key: HDFS-5685 URL: https://issues.apache.org/jira/browse/HDFS-5685 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 1.0.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang When using distcp command to copy files with -delete switch, running as user xyz, hadoop distcp -p -i -update -delete hdfs://srchost:port/user hdfs://dsthost:port/user It fails with the following exception: Copy failed: java.io.FileNotFoundException: File does not exist: hdfs://dsthost:port/user/xyz/.stagingdistcp_urjb0g/_distcp_src_files at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:557) at org.apache.hadoop.tools.DistCp$CopyInputFormat.getSplits(DistCp.java:266) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:667) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5220) Expose group resolution time as metric
[ https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852549#comment-13852549 ] Jimmy Xiang commented on HDFS-5220: --- The avg time is 0? It is so fast to find the group in this box. Let me enhance the test a little. The other test failure is not related. Expose group resolution time as metric -- Key: HDFS-5220 URL: https://issues.apache.org/jira/browse/HDFS-5220 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Attachments: hdfs-5220.patch It would help detect issues with authentication configuration and with overloading an authentication source if the name node exposed the time taken for group resolution as a metric. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Status: Open (was: Patch Available) Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Attachment: trunk-5666_v2.patch Andrew, thanks a lot for the review. I fixed the GetImage metric. I also added a GetEdit metric for edits transferring. Instead of adding a test in TestNameNodeMetrics, I added some metrics checking in TestBackupNode#testCheckpointNode so that we don't have to copy the code around (to create some image/edit related activities). How is that? Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch, trunk-5666_v2.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Attachment: (was: trunk-5666_v2.patch) Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849387#comment-13849387 ] Jimmy Xiang commented on HDFS-5350: --- Sorry, wrong patch. I see. Let me take a look TestCheckpoint. Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Status: Patch Available (was: Open) Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch, trunk-5350_v3.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Attachment: trunk-5350_v3.patch Attached patch v3, moved the test to TestCheckpoint. It is hard to predict how many times getimage/getedit/putimage are called. So I just verified the metrics are updated generally. Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch, trunk-5350_v3.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Resolved] (HDFS-5566) HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable
[ https://issues.apache.org/jira/browse/HDFS-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang resolved HDFS-5566. --- Resolution: Invalid Looked into it and found out that it is an HBase issue actually. HDFS is good. HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable -- Key: HDFS-5566 URL: https://issues.apache.org/jira/browse/HDFS-5566 Project: Hadoop HDFS Issue Type: Bug Environment: hadoop-2.2.0 hbase-0.96 Reporter: Henry Hung Assignee: Jimmy Xiang When using hbase-0.96 with hadoop-2.2.0, stopping master/regionserver node will result in {{Cannot close proxy - is not Closeable or does not provide closeable invocation}}. [Mail Archive|https://drive.google.com/file/d/0B22pkxoqCdvWSGFIaEpfR3lnT2M/edit?usp=sharing] My hadoop-2.2.0 configured as HA namenode with QJM, the configuration is like this: {code:xml} property namedfs.nameservices/name valuehadoopdev/value /property property namedfs.ha.namenodes.hadoopdev/name valuenn1,nn2/value /property property namedfs.namenode.rpc-address.hadoopdev.nn1/name valuefphd9.ctpilot1.com:9000/value /property property namedfs.namenode.http-address.hadoopdev.nn1/name valuefphd9.ctpilot1.com:50070/value /property property namedfs.namenode.rpc-address.hadoopdev.nn2/name valuefphd10.ctpilot1.com:9000/value /property property namedfs.namenode.http-address.hadoopdev.nn2/name valuefphd10.ctpilot1.com:50070/value /property property namedfs.namenode.shared.edits.dir/name valueqjournal://fphd8.ctpilot1.com:8485;fphd9.ctpilot1.com:8485;fphd10.ctpilot1.com:8485/hadoopdev/value /property property namedfs.client.failover.proxy.provider.hadoopdev/name valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value /property property namedfs.ha.fencing.methods/name valueshell(/bin/true)/value /property property namedfs.journalnode.edits.dir/name value/data/hadoop/hadoop-data-2/journal/value /property property namedfs.ha.automatic-failover.enabled/name valuetrue/value /property property nameha.zookeeper.quorum/name valuefphd1.ctpilot1.com:/value /property {code} I traced the code and found out that when stopping the hbase master node, it will try invoke method close on namenode, but the instance that created from {{org.apache.hadoop.hdfs.NameNodeProxies.createProxy}} with failoverProxyProviderClass {{org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider}} do not have the Closeable interface. If we use the Non-HA case, the created instance will be {{org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB}} that implement Closeable. TL;DR; With hbase connecting to hadoop HA namenode, when stopping the hbase master or regionserver, it couldn't find the {{close}} method to gracefully close namenode session. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5668: -- Issue Type: Bug (was: Task) TestBPOfferService.testBPInitErrorHandling fails intermittently --- Key: HDFS-5668 URL: https://issues.apache.org/jira/browse/HDFS-5668 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 3.0.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor The new test introduced in HDFS-4201 is a little flaky. I got failures locally occasionally. It could be related to how we did the mockup. {noformat} Exception in thread DataNode: [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data] heartbeating to 0.0.0.0/0.0.0.0:0 org.mockito.exceptions.misusing.WrongTypeOfReturnValue: SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned by getStorageId() getStorageId() should return String at org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723) 2013-12-13 13:42:03,119 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1 at java.lang.Thread.run(Thread.java:722) {noformat} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5668: -- Component/s: (was: test) namenode TestBPOfferService.testBPInitErrorHandling fails intermittently --- Key: HDFS-5668 URL: https://issues.apache.org/jira/browse/HDFS-5668 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor The new test introduced in HDFS-4201 is a little flaky. I got failures locally occasionally. It could be related to how we did the mockup. {noformat} Exception in thread DataNode: [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data] heartbeating to 0.0.0.0/0.0.0.0:0 org.mockito.exceptions.misusing.WrongTypeOfReturnValue: SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned by getStorageId() getStorageId() should return String at org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723) 2013-12-13 13:42:03,119 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1 at java.lang.Thread.run(Thread.java:722) {noformat} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5668: -- Status: Patch Available (was: Open) TestBPOfferService.testBPInitErrorHandling fails intermittently --- Key: HDFS-5668 URL: https://issues.apache.org/jira/browse/HDFS-5668 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5668.patch The new test introduced in HDFS-4201 is a little flaky. I got failures locally occasionally. It could be related to how we did the mockup. {noformat} Exception in thread DataNode: [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data] heartbeating to 0.0.0.0/0.0.0.0:0 org.mockito.exceptions.misusing.WrongTypeOfReturnValue: SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned by getStorageId() getStorageId() should return String at org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723) 2013-12-13 13:42:03,119 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1 at java.lang.Thread.run(Thread.java:722) {noformat} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5668: -- Attachment: trunk-5668.patch It turns out to be a bug. BPOfferServer#toString is not synchronized so that it could see partially initialized dn/bpNSInfo. TestBPOfferService.testBPInitErrorHandling fails intermittently --- Key: HDFS-5668 URL: https://issues.apache.org/jira/browse/HDFS-5668 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5668.patch The new test introduced in HDFS-4201 is a little flaky. I got failures locally occasionally. It could be related to how we did the mockup. {noformat} Exception in thread DataNode: [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data] heartbeating to 0.0.0.0/0.0.0.0:0 org.mockito.exceptions.misusing.WrongTypeOfReturnValue: SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned by getStorageId() getStorageId() should return String at org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723) 2013-12-13 13:42:03,119 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1 at java.lang.Thread.run(Thread.java:722) {noformat} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5668: -- Resolution: Duplicate Status: Resolved (was: Patch Available) TestBPOfferService.testBPInitErrorHandling fails intermittently --- Key: HDFS-5668 URL: https://issues.apache.org/jira/browse/HDFS-5668 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5668.patch The new test introduced in HDFS-4201 is a little flaky. I got failures locally occasionally. It could be related to how we did the mockup. {noformat} Exception in thread DataNode: [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data] heartbeating to 0.0.0.0/0.0.0.0:0 org.mockito.exceptions.misusing.WrongTypeOfReturnValue: SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned by getStorageId() getStorageId() should return String at org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723) 2013-12-13 13:42:03,119 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1 at java.lang.Thread.run(Thread.java:722) {noformat} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5666: -- Component/s: (was: test) namenode TestBPOfferService#/testBPInitErrorHandling fails intermittently Key: HDFS-5666 URL: https://issues.apache.org/jira/browse/HDFS-5666 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Priority: Minor Intermittent failure on this test: {code} Regression org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling Failing for the past 1 build (Since #5698 ) Took 0.16 sec. Error Message expected:1 but was:2 Stacktrace java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) {code} see https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/ -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Assigned] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HDFS-5666: - Assignee: Jimmy Xiang TestBPOfferService#/testBPInitErrorHandling fails intermittently Key: HDFS-5666 URL: https://issues.apache.org/jira/browse/HDFS-5666 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jimmy Xiang Priority: Minor Intermittent failure on this test: {code} Regression org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling Failing for the past 1 build (Since #5698 ) Took 0.16 sec. Error Message expected:1 but was:2 Stacktrace java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) {code} see https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/ -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5666: -- Attachment: trunk-5666.patch It's a bug in BPOfferService#toString which is not synchronized, so it got some partially initialized info. TestBPOfferService#/testBPInitErrorHandling fails intermittently Key: HDFS-5666 URL: https://issues.apache.org/jira/browse/HDFS-5666 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5666.patch Intermittent failure on this test: {code} Regression org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling Failing for the past 1 build (Since #5698 ) Took 0.16 sec. Error Message expected:1 but was:2 Stacktrace java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) {code} see https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/ -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5666: -- Status: Patch Available (was: Open) TestBPOfferService#/testBPInitErrorHandling fails intermittently Key: HDFS-5666 URL: https://issues.apache.org/jira/browse/HDFS-5666 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5666.patch Intermittent failure on this test: {code} Regression org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling Failing for the past 1 build (Since #5698 ) Took 0.16 sec. Error Message expected:1 but was:2 Stacktrace java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) {code} see https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/ -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848743#comment-13848743 ] Jimmy Xiang commented on HDFS-5666: --- With the patch, I don't see the test fails locally for quite some runs. TestBPOfferService#/testBPInitErrorHandling fails intermittently Key: HDFS-5666 URL: https://issues.apache.org/jira/browse/HDFS-5666 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5666.patch Intermittent failure on this test: {code} Regression org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling Failing for the past 1 build (Since #5698 ) Took 0.16 sec. Error Message expected:1 but was:2 Stacktrace java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) {code} see https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/ -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5666: -- Attachment: trunk-5666_v2.patch TestBPOfferService#/testBPInitErrorHandling fails intermittently Key: HDFS-5666 URL: https://issues.apache.org/jira/browse/HDFS-5666 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5666.patch, trunk-5666_v2.patch Intermittent failure on this test: {code} Regression org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling Failing for the past 1 build (Since #5698 ) Took 0.16 sec. Error Message expected:1 but was:2 Stacktrace java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) {code} see https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/ -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848788#comment-13848788 ] Jimmy Xiang commented on HDFS-5666: --- Agree. Attached patch v2. TestBPOfferService#/testBPInitErrorHandling fails intermittently Key: HDFS-5666 URL: https://issues.apache.org/jira/browse/HDFS-5666 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jimmy Xiang Priority: Minor Attachments: trunk-5666.patch, trunk-5666_v2.patch Intermittent failure on this test: {code} Regression org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling Failing for the past 1 build (Since #5698 ) Took 0.16 sec. Error Message expected:1 but was:2 Stacktrace java.lang.AssertionError: expected:1 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) {code} see https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/ -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently
Jimmy Xiang created HDFS-5668: - Summary: TestBPOfferService.testBPInitErrorHandling fails intermittently Key: HDFS-5668 URL: https://issues.apache.org/jira/browse/HDFS-5668 Project: Hadoop HDFS Issue Type: Task Components: test Affects Versions: 3.0.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor The new test introduced in HDFS-4201 is a little flaky. I got failures locally occasionally. It could be related to how we did the mockup. {noformat} Exception in thread DataNode: [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data] heartbeating to 0.0.0.0/0.0.0.0:0 org.mockito.exceptions.misusing.WrongTypeOfReturnValue: SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned by getStorageId() getStorageId() should return String at org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723) 2013-12-13 13:42:03,119 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1 at java.lang.Thread.run(Thread.java:722) {noformat} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5566) HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable
[ https://issues.apache.org/jira/browse/HDFS-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848222#comment-13848222 ] Jimmy Xiang commented on HDFS-5566: --- In NameNodeProxies#createProxy, it creates a proxy with interface ClientProtocol which is not Closable, and a RetryInvocationHandler, for the HA case. However, ConfiguredFailoverProxyProvider doesn't have field h, the InvocationHandler, which is the problem. I think this is a valid bug we need to fix. HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable -- Key: HDFS-5566 URL: https://issues.apache.org/jira/browse/HDFS-5566 Project: Hadoop HDFS Issue Type: Bug Environment: hadoop-2.2.0 hbase-0.96 Reporter: Henry Hung When using hbase-0.96 with hadoop-2.2.0, stopping master/regionserver node will result in {{Cannot close proxy - is not Closeable or does not provide closeable invocation}}. [Mail Archive|https://drive.google.com/file/d/0B22pkxoqCdvWSGFIaEpfR3lnT2M/edit?usp=sharing] My hadoop-2.2.0 configured as HA namenode with QJM, the configuration is like this: {code:xml} property namedfs.nameservices/name valuehadoopdev/value /property property namedfs.ha.namenodes.hadoopdev/name valuenn1,nn2/value /property property namedfs.namenode.rpc-address.hadoopdev.nn1/name valuefphd9.ctpilot1.com:9000/value /property property namedfs.namenode.http-address.hadoopdev.nn1/name valuefphd9.ctpilot1.com:50070/value /property property namedfs.namenode.rpc-address.hadoopdev.nn2/name valuefphd10.ctpilot1.com:9000/value /property property namedfs.namenode.http-address.hadoopdev.nn2/name valuefphd10.ctpilot1.com:50070/value /property property namedfs.namenode.shared.edits.dir/name valueqjournal://fphd8.ctpilot1.com:8485;fphd9.ctpilot1.com:8485;fphd10.ctpilot1.com:8485/hadoopdev/value /property property namedfs.client.failover.proxy.provider.hadoopdev/name valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value /property property namedfs.ha.fencing.methods/name valueshell(/bin/true)/value /property property namedfs.journalnode.edits.dir/name value/data/hadoop/hadoop-data-2/journal/value /property property namedfs.ha.automatic-failover.enabled/name valuetrue/value /property property nameha.zookeeper.quorum/name valuefphd1.ctpilot1.com:/value /property {code} I traced the code and found out that when stopping the hbase master node, it will try invoke method close on namenode, but the instance that created from {{org.apache.hadoop.hdfs.NameNodeProxies.createProxy}} with failoverProxyProviderClass {{org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider}} do not have the Closeable interface. If we use the Non-HA case, the created instance will be {{org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB}} that implement Closeable. TL;DR; With hbase connecting to hadoop HA namenode, when stopping the hbase master or regionserver, it couldn't find the {{close}} method to gracefully close namenode session. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Reopened] (HDFS-5566) HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable
[ https://issues.apache.org/jira/browse/HDFS-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reopened HDFS-5566: --- Assignee: Jimmy Xiang HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable -- Key: HDFS-5566 URL: https://issues.apache.org/jira/browse/HDFS-5566 Project: Hadoop HDFS Issue Type: Bug Environment: hadoop-2.2.0 hbase-0.96 Reporter: Henry Hung Assignee: Jimmy Xiang When using hbase-0.96 with hadoop-2.2.0, stopping master/regionserver node will result in {{Cannot close proxy - is not Closeable or does not provide closeable invocation}}. [Mail Archive|https://drive.google.com/file/d/0B22pkxoqCdvWSGFIaEpfR3lnT2M/edit?usp=sharing] My hadoop-2.2.0 configured as HA namenode with QJM, the configuration is like this: {code:xml} property namedfs.nameservices/name valuehadoopdev/value /property property namedfs.ha.namenodes.hadoopdev/name valuenn1,nn2/value /property property namedfs.namenode.rpc-address.hadoopdev.nn1/name valuefphd9.ctpilot1.com:9000/value /property property namedfs.namenode.http-address.hadoopdev.nn1/name valuefphd9.ctpilot1.com:50070/value /property property namedfs.namenode.rpc-address.hadoopdev.nn2/name valuefphd10.ctpilot1.com:9000/value /property property namedfs.namenode.http-address.hadoopdev.nn2/name valuefphd10.ctpilot1.com:50070/value /property property namedfs.namenode.shared.edits.dir/name valueqjournal://fphd8.ctpilot1.com:8485;fphd9.ctpilot1.com:8485;fphd10.ctpilot1.com:8485/hadoopdev/value /property property namedfs.client.failover.proxy.provider.hadoopdev/name valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value /property property namedfs.ha.fencing.methods/name valueshell(/bin/true)/value /property property namedfs.journalnode.edits.dir/name value/data/hadoop/hadoop-data-2/journal/value /property property namedfs.ha.automatic-failover.enabled/name valuetrue/value /property property nameha.zookeeper.quorum/name valuefphd1.ctpilot1.com:/value /property {code} I traced the code and found out that when stopping the hbase master node, it will try invoke method close on namenode, but the instance that created from {{org.apache.hadoop.hdfs.NameNodeProxies.createProxy}} with failoverProxyProviderClass {{org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider}} do not have the Closeable interface. If we use the Non-HA case, the created instance will be {{org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB}} that implement Closeable. TL;DR; With hbase connecting to hadoop HA namenode, when stopping the hbase master or regionserver, it couldn't find the {{close}} method to gracefully close namenode session. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Priority: Minor (was: Major) Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Attachment: trunk-5350.patch Attached a patch that added metrics for fsimage downloaded/uploaded. Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-5350: -- Fix Version/s: 3.0.0 Status: Patch Available (was: Open) Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845805#comment-13845805 ] Jimmy Xiang commented on HDFS-5350: --- I tested the patch on my cluster. Here is the new metrics from the jmx page: {noformat} GetImageNumOps : 56, GetImageAvgTime : 3.75, PutImageNumOps : 51, PutImageAvgTime : 80.0 {noformat} Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang Priority: Minor Fix For: 3.0.0 Attachments: trunk-5350.patch If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4201: -- Status: Patch Available (was: Open) NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch, trunk-4201_v3.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4201: -- Status: Open (was: Patch Available) NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch, trunk-4201_v3.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1381#comment-1381 ] Jimmy Xiang commented on HDFS-4201: --- The javadoc warnings are not related to patch: https://builds.apache.org/job/PreCommit-HDFS-Build/5685/artifact/trunk/patchprocess/patchJavadocWarnings.txt The audit warning is due to some memory issue: # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 172104 bytes for Arena::Amalloc # An error report file with more information is saved as: # /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/trunk/hs_err_pid24616.log Trying the hadoop-qa again. NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch, trunk-4201_v3.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844678#comment-13844678 ] Jimmy Xiang commented on HDFS-4201: --- The test failure is not related to the patch: Problem binding to [0.0.0.0:50010] java.net.BindException. It works fine locally and in previous build. NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch, trunk-4201_v3.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Assigned] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HDFS-5350: - Assignee: Jimmy Xiang (was: Andrew Wang) Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5350) Name Node should report fsimage transfer time as a metric
[ https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844884#comment-13844884 ] Jimmy Xiang commented on HDFS-5350: --- Instead of some sliding window metrics, I will add two MutableRate metrics for fsimage upload and download latency. From these information, we can tell if fsimage transfer is normal or not too. Name Node should report fsimage transfer time as a metric - Key: HDFS-5350 URL: https://issues.apache.org/jira/browse/HDFS-5350 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Rob Weltman Assignee: Jimmy Xiang If the (Secondary) Name Node reported fsimage transfer times (perhaps the last ten of them), monitoring tools could detect slowdowns that might jeopardize cluster stability. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843458#comment-13843458 ] Jimmy Xiang commented on HDFS-4201: --- That's another solution I considered. With try+finally, we need to catch all known and unknown exceptions thrown by initBlockPool, then re-throw it, which may not look very good. Is there any known initialization patch change coming soon? NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843622#comment-13843622 ] Jimmy Xiang commented on HDFS-4201: --- Sure, I will do as suggested so that we can minimize the changes. Thanks. NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4201: -- Attachment: trunk-4201_v3.patch Attached v3 that isolated the changes as suggested. NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch, trunk-4201_v3.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4201: -- Status: Open (was: Patch Available) Looking into the test failures. NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4201: -- Status: Patch Available (was: Open) NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HDFS-4201: -- Attachment: trunk-4201_v2.patch Fixed the test failures. Also enhanced the fix a little so that we register the block pool after the datanode initialization is done. NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Fix For: 3.0.0 Attachments: trunk-4201.patch, trunk-4201_v2.patch Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat
[ https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HDFS-4201: - Assignee: Jimmy Xiang NPE in BPServiceActor#sendHeartBeat --- Key: HDFS-4201 URL: https://issues.apache.org/jira/browse/HDFS-4201 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Eli Collins Assignee: Jimmy Xiang Priority: Critical Saw the following NPE in a log. Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not {{bpRegistration}}) due to a configuration or local directory failure. {code} 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 30 msec BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; heartBeatInterval=3000 2012-09-25 04:33:20,782 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id DS-1031100678-11.164.162.251-5010-1341933415989) service to svsrs00127/11.164.162.226:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.1#6144)