[jira] [Updated] (HDFS-11272) Missing whitespace in the message of TestFSDirAttrOp.java

2016-12-22 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11272:
---
 Assignee: Jimmy Xiang
Affects Version/s: 2.8.0
   Status: Patch Available  (was: Open)

> Missing whitespace in the message of TestFSDirAttrOp.java
> -
>
> Key: HDFS-11272
> URL: https://issues.apache.org/jira/browse/HDFS-11272
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Akira Ajisaka
>Assignee: Jimmy Xiang
>Priority: Trivial
>  Labels: newbie
> Attachments: hdfs-11272.patch
>
>
> {code:title=TestFSDirAttrOp.java}
> assertFalse("SetTimes should not update access time"
>   + "because it's within the last precision interval",
> {code}
> A whitespace is missing between time and because.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11272) Missing whitespace in the message of TestFSDirAttrOp.java

2016-12-22 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11272:
---
Attachment: hdfs-11272.patch

> Missing whitespace in the message of TestFSDirAttrOp.java
> -
>
> Key: HDFS-11272
> URL: https://issues.apache.org/jira/browse/HDFS-11272
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Akira Ajisaka
>Priority: Trivial
>  Labels: newbie
> Attachments: hdfs-11272.patch
>
>
> {code:title=TestFSDirAttrOp.java}
> assertFalse("SetTimes should not update access time"
>   + "because it's within the last precision interval",
> {code}
> A whitespace is missing between time and because.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11258) File mtime change could not save to editlog

2016-12-22 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770512#comment-15770512
 ] 

Jimmy Xiang commented on HDFS-11258:


The addendum is good for both branch 2 and 2.8. 

> File mtime change could not save to editlog
> ---
>
> Key: HDFS-11258
> URL: https://issues.apache.org/jira/browse/HDFS-11258
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Critical
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: hdfs-11258-addendum-branch2.patch, hdfs-11258.1.patch, 
> hdfs-11258.2.patch, hdfs-11258.3.patch, hdfs-11258.4.patch
>
>
> When both mtime and atime are changed, and atime is not beyond the precision 
> limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog

2016-12-22 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11258:
---
Attachment: hdfs-11258-addendum-branch2.patch

Attached an addendum for branch-2 to fix compilation.

> File mtime change could not save to editlog
> ---
>
> Key: HDFS-11258
> URL: https://issues.apache.org/jira/browse/HDFS-11258
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Critical
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: hdfs-11258-addendum-branch2.patch, hdfs-11258.1.patch, 
> hdfs-11258.2.patch, hdfs-11258.3.patch, hdfs-11258.4.patch
>
>
> When both mtime and atime are changed, and atime is not beyond the precision 
> limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog

2016-12-20 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11258:
---
Attachment: hdfs-11258.4.patch

> File mtime change could not save to editlog
> ---
>
> Key: HDFS-11258
> URL: https://issues.apache.org/jira/browse/HDFS-11258
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Critical
> Attachments: hdfs-11258.1.patch, hdfs-11258.2.patch, 
> hdfs-11258.3.patch, hdfs-11258.4.patch
>
>
> When both mtime and atime are changed, and atime is not beyond the precision 
> limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog

2016-12-20 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11258:
---
Attachment: hdfs-11258.3.patch

Thanks [~wheat9] for the review. Fixed the checkstyles.

> File mtime change could not save to editlog
> ---
>
> Key: HDFS-11258
> URL: https://issues.apache.org/jira/browse/HDFS-11258
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Critical
> Attachments: hdfs-11258.1.patch, hdfs-11258.2.patch, 
> hdfs-11258.3.patch
>
>
> When both mtime and atime are changed, and atime is not beyond the precision 
> limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog

2016-12-18 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11258:
---
Attachment: hdfs-11258.2.patch

Patch v2, added a unit test.

> File mtime change could not save to editlog
> ---
>
> Key: HDFS-11258
> URL: https://issues.apache.org/jira/browse/HDFS-11258
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Minor
> Attachments: hdfs-11258.1.patch, hdfs-11258.2.patch
>
>
> When both mtime and atime are changed, and atime is not beyond the precision 
> limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog

2016-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11258:
---
Status: Patch Available  (was: Open)

> File mtime change could not save to editlog
> ---
>
> Key: HDFS-11258
> URL: https://issues.apache.org/jira/browse/HDFS-11258
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Minor
> Attachments: hdfs-11258.1.patch
>
>
> When both mtime and atime are changed, and atime is not beyond the precision 
> limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11258) File mtime change could not save to editlog

2016-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-11258:
---
Attachment: hdfs-11258.1.patch

> File mtime change could not save to editlog
> ---
>
> Key: HDFS-11258
> URL: https://issues.apache.org/jira/browse/HDFS-11258
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Minor
> Attachments: hdfs-11258.1.patch
>
>
> When both mtime and atime are changed, and atime is not beyond the precision 
> limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11258) File mtime change could not save to editlog

2016-12-16 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HDFS-11258:
--

 Summary: File mtime change could not save to editlog
 Key: HDFS-11258
 URL: https://issues.apache.org/jira/browse/HDFS-11258
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor


When both mtime and atime are changed, and atime is not beyond the precision 
limit, the mtime change is not saved to edit logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-09-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--
Assignee: (was: Jimmy Xiang)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-4284) BlockReaderLocal not notified of failed disks

2014-09-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4284:
--
Assignee: (was: Jimmy Xiang)

 BlockReaderLocal not notified of failed disks
 -

 Key: HDFS-4284
 URL: https://issues.apache.org/jira/browse/HDFS-4284
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.0.2-alpha
Reporter: Andy Isaacson

 When a DN marks a disk as bad, it stops using replicas on that disk.
 However a long-running {{BlockReaderLocal}} instance will continue to access 
 replicas on the failing disk.
 Somehow we should let the in-client BlockReaderLocal know that a disk has 
 been marked as bad so that it can stop reading from the bad disk.
 From HDFS-4239:
 bq. To rephrase that, a long running BlockReaderLocal will ride over local DN 
 restarts and disk ejections. We had to drain the RS of all its regions in 
 order to stop it from using the bad disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-09-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--
Status: Open  (was: Patch Available)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-09-05 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123715#comment-14123715
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Sure. Assigned it to you.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Yongjun Zhang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-09-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--
Assignee: Yongjun Zhang

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Yongjun Zhang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-10 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897069#comment-13897069
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Ping. Can anyone take  a look patch v4? Thanks.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5882) TestAuditLogs is flaky

2014-02-05 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892424#comment-13892424
 ] 

Jimmy Xiang commented on HDFS-5882:
---

Ok, let me take a look at that.

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky

2014-02-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5882:
--

Status: Open  (was: Patch Available)

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky

2014-02-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5882:
--

Status: Patch Available  (was: Open)

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch, hdfs-5882_v2.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky

2014-02-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5882:
--

Attachment: hdfs-5882_v2.patch

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch, hdfs-5882_v2.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5882) TestAuditLogs is flaky

2014-02-05 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892544#comment-13892544
 ] 

Jimmy Xiang commented on HDFS-5882:
---

Posted v2 that force-flush the logs by closing the appenders. The flaky is 
mostly for the async-appender. Closing it will flush the logs.

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch, hdfs-5882_v2.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-04 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Open  (was: Patch Available)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HDFS-5882) TestAuditLogs is flaky

2014-02-04 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HDFS-5882:
-

 Summary: TestAuditLogs is flaky
 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Priority: Minor


TestAuditLogs fails sometimes:

{noformat}
Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec  
FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs)  
Time elapsed: 2.085 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:92)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertNotNull(Assert.java:526)
at org.junit.Assert.assertNotNull(Assert.java:537)
at 
org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
at 
org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
at 
org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-04 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Attachment: hdfs-4239_v5.patch

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-04 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Patch Available  (was: Open)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HDFS-5883) TestZKPermissionsWatcher.testPermissionsWatcher fails sometimes

2014-02-04 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HDFS-5883:
-

 Summary: TestZKPermissionsWatcher.testPermissionsWatcher fails 
sometimes
 Key: HDFS-5883
 URL: https://issues.apache.org/jira/browse/HDFS-5883
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Trivial


It looks like sleeping 100 ms is not enough for the permission change to 
propagate to other watchers. Will increase the sleeping time a little.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (HDFS-5882) TestAuditLogs is flaky

2014-02-04 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HDFS-5882:
-

Assignee: Jimmy Xiang

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky

2014-02-04 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5882:
--

Attachment: hdfs-5882.patch

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5882) TestAuditLogs is flaky

2014-02-04 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5882:
--

Status: Patch Available  (was: Open)

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5882) TestAuditLogs is flaky

2014-02-04 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891531#comment-13891531
 ] 

Jimmy Xiang commented on HDFS-5882:
---

I was thinking to force-flush the logger to disk too, but there isn't an easy 
way. With the current patch, I don't see the problem any more locally.

 TestAuditLogs is flaky
 --

 Key: HDFS-5882
 URL: https://issues.apache.org/jira/browse/HDFS-5882
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hdfs-5882.patch


 TestAuditLogs fails sometimes:
 {noformat}
 Running org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 37.913 sec 
  FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestAuditLogs
 testAuditAllowedStat[1](org.apache.hadoop.hdfs.server.namenode.TestAuditLogs) 
  Time elapsed: 2.085 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:92)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at org.junit.Assert.assertNotNull(Assert.java:526)
   at org.junit.Assert.assertNotNull(Assert.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogsRepeat(TestAuditLogs.java:312)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.verifyAuditLogs(TestAuditLogs.java:295)
   at 
 org.apache.hadoop.hdfs.server.namenode.TestAuditLogs.testAuditAllowedStat(TestAuditLogs.java:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-03 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Open  (was: Patch Available)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-03 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889841#comment-13889841
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Good point. Let me handle the access control in next patch. As to blacklisted 
volume IDs, can we handle it in a separate issue?

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-03 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Patch Available  (was: Open)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-03 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Attachment: hdfs-4239_v4.patch

Attached v4 that added access control if security is enabled.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-30 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887291#comment-13887291
 ] 

Jimmy Xiang commented on HDFS-4239:
---

This test failure is not related.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-29 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Open  (was: Patch Available)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-29 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Attachment: hdfs-4239_v3.patch

Attached v3 that fixed the test failures.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-29 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Patch Available  (was: Open)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-28 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884897#comment-13884897
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Cool. I agree. Attached v2 that released all references to the volume marked 
down. In my test, I don't see any open file descriptor pointing to the volume 
marked down.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-28 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Attachment: hdfs-4239_v2.patch

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-28 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Patch Available  (was: Open)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (HDFS-4284) BlockReaderLocal not notified of failed disks

2014-01-28 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HDFS-4284:
-

Assignee: Jimmy Xiang

 BlockReaderLocal not notified of failed disks
 -

 Key: HDFS-4284
 URL: https://issues.apache.org/jira/browse/HDFS-4284
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.0.2-alpha
Reporter: Andy Isaacson
Assignee: Jimmy Xiang

 When a DN marks a disk as bad, it stops using replicas on that disk.
 However a long-running {{BlockReaderLocal}} instance will continue to access 
 replicas on the failing disk.
 Somehow we should let the in-client BlockReaderLocal know that a disk has 
 been marked as bad so that it can stop reading from the bad disk.
 From HDFS-4239:
 bq. To rephrase that, a long running BlockReaderLocal will ride over local DN 
 restarts and disk ejections. We had to drain the RS of all its regions in 
 order to stop it from using the bad disk.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-27 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883728#comment-13883728
 ] 

Jimmy Xiang commented on HDFS-4239:
---

We can release the lock after the volume is marked down. No new block will be 
allocated to this volume. How about those blocks on this volume being writing? 
The writing could take forever, for example, a rarely updated HLog file. I was 
thinking to fail the writing pipeline so that the client can set up another 
pipeline. Any problem with that?

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-17 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Open  (was: Patch Available)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-17 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875538#comment-13875538
 ] 

Jimmy Xiang commented on HDFS-4239:
---

File 'in_use.lock' is still there after the volume is marked down.  Let me take 
another look.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Attachment: hdfs-4239.patch

Attached a path for trunk. It's good for branch 2 too.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4239:
--

Status: Patch Available  (was: Open)

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-13 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HDFS-4239:
-

Assignee: Jimmy Xiang

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5220) Expose group resolution time as metric

2014-01-06 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5220:
--

Attachment: 2.4-5220.patch

Thanks a lot, [~cmccabe]. I attached 2.4-5220.patch for 2.4.

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.4.0
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Fix For: 3.0.0

 Attachments: 2.4-5220.patch, hdfs-5220.patch, hdfs-5220_v2.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5220) Expose group resolution time as metric

2014-01-06 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863451#comment-13863451
 ] 

Jimmy Xiang commented on HDFS-5220:
---

The test is a little flaky. It depends on the execution order. Let me fix it.

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.4.0
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Fix For: 2.4.0

 Attachments: 2.4-5220.patch, hdfs-5220.patch, hdfs-5220_v2.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5220) Expose group resolution time as metric

2014-01-06 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5220:
--

Attachment: hdfs-5220.addendum
2.4-5220.addendum

If TestUserGroupInformation#testGetServerSideGroups() runs first, 
TestUserGroupInformation#testLogin will fail. Attached two addendum to change 
the group verification a little.

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.4.0
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Fix For: 2.4.0

 Attachments: 2.4-5220.addendum, 2.4-5220.patch, hdfs-5220.addendum, 
 hdfs-5220.patch, hdfs-5220_v2.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5220) Expose group resolution time as metric

2014-01-06 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863500#comment-13863500
 ] 

Jimmy Xiang commented on HDFS-5220:
---

Let's use a separate jira instead: HADOOP-10207.

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.4.0
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Fix For: 2.4.0

 Attachments: 2.4-5220.addendum, 2.4-5220.patch, hdfs-5220.addendum, 
 hdfs-5220.patch, hdfs-5220_v2.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5220) Expose group resolution time as metric

2013-12-19 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5220:
--

Attachment: hdfs-5220_v2.patch

Attached v2 that initialize and populate the quantile metrics in 
UserGroupInformation.setConfiguration, so that we don't change the 
UserGroupInformation interface.

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Attachments: hdfs-5220.patch, hdfs-5220_v2.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5220) Expose group resolution time as metric

2013-12-19 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5220:
--

Status: Patch Available  (was: Open)

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Attachments: hdfs-5220.patch, hdfs-5220_v2.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Assigned] (HDFS-5220) Expose group resolution time as metric

2013-12-18 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HDFS-5220:
-

Assignee: Jimmy Xiang

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang

 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5220) Expose group resolution time as metric

2013-12-18 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5220:
--

Attachment: hdfs-5220.patch

Add the first version patch.  One drawback is that the quantile info is in a 
different metrics from the rate info. The other solution is to add a percentile 
configuration to common, only if it is set, we initialize and populate the 
quantile metrics.

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Attachments: hdfs-5220.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5220) Expose group resolution time as metric

2013-12-18 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5220:
--

Status: Patch Available  (was: Open)

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Attachments: hdfs-5220.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5220) Expose group resolution time as metric

2013-12-18 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5220:
--

Status: Open  (was: Patch Available)

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Attachments: hdfs-5220.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5685) DistCp will fail to copy with -delete switch

2013-12-18 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852469#comment-13852469
 ] 

Jimmy Xiang commented on HDFS-5685:
---

Do we know why it can't find the file? Is it because the file is already copied 
by a failed task?

 DistCp will fail to copy with -delete switch
 

 Key: HDFS-5685
 URL: https://issues.apache.org/jira/browse/HDFS-5685
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 1.0.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang

 When using distcp command to copy files with -delete switch, running as user 
 xyz,
 hadoop distcp -p -i -update  -delete hdfs://srchost:port/user 
 hdfs://dsthost:port/user
 It fails with the following exception:
 Copy failed: java.io.FileNotFoundException: File does not exist: 
 hdfs://dsthost:port/user/xyz/.stagingdistcp_urjb0g/_distcp_src_files
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:557)
 at 
 org.apache.hadoop.tools.DistCp$CopyInputFormat.getSplits(DistCp.java:266)
 at 
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
 at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
 at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 at org.apache.hadoop.tools.DistCp.copy(DistCp.java:667)
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5685) DistCp will fail to copy with -delete switch

2013-12-18 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852473#comment-13852473
 ] 

Jimmy Xiang commented on HDFS-5685:
---

Looks like it's a Hadoop/MR instead of HDFS issue.

 DistCp will fail to copy with -delete switch
 

 Key: HDFS-5685
 URL: https://issues.apache.org/jira/browse/HDFS-5685
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 1.0.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang

 When using distcp command to copy files with -delete switch, running as user 
 xyz,
 hadoop distcp -p -i -update  -delete hdfs://srchost:port/user 
 hdfs://dsthost:port/user
 It fails with the following exception:
 Copy failed: java.io.FileNotFoundException: File does not exist: 
 hdfs://dsthost:port/user/xyz/.stagingdistcp_urjb0g/_distcp_src_files
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:557)
 at 
 org.apache.hadoop.tools.DistCp$CopyInputFormat.getSplits(DistCp.java:266)
 at 
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
 at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
 at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 at org.apache.hadoop.tools.DistCp.copy(DistCp.java:667)
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5220) Expose group resolution time as metric

2013-12-18 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13852549#comment-13852549
 ] 

Jimmy Xiang commented on HDFS-5220:
---

The avg time is 0? It is so fast to find the group in this box. Let me enhance 
the test a little. The other test failure is not related.

 Expose group resolution time as metric
 --

 Key: HDFS-5220
 URL: https://issues.apache.org/jira/browse/HDFS-5220
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
 Attachments: hdfs-5220.patch


 It would help detect issues with authentication configuration and with 
 overloading an authentication source if the name node exposed the time taken 
 for group resolution as a metric.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Status: Open  (was: Patch Available)

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Attachment: trunk-5666_v2.patch

Andrew, thanks a lot for the review.  I fixed the GetImage metric. I also added 
a GetEdit metric for edits transferring. Instead of adding a test in 
TestNameNodeMetrics, I added some metrics checking in 
TestBackupNode#testCheckpointNode so that we don't have to copy the code around 
(to create some image/edit related activities). How is that?

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch, trunk-5666_v2.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Attachment: (was: trunk-5666_v2.patch)

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-16 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849387#comment-13849387
 ] 

Jimmy Xiang commented on HDFS-5350:
---

Sorry, wrong patch. I see. Let me take a look TestCheckpoint.

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Status: Patch Available  (was: Open)

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch, trunk-5350_v3.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Attachment: trunk-5350_v3.patch

Attached patch v3, moved the test to TestCheckpoint. It is hard to predict how 
many times getimage/getedit/putimage are called. So I just verified the metrics 
are updated generally.

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch, trunk-5350_v3.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Resolved] (HDFS-5566) HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable

2013-12-16 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang resolved HDFS-5566.
---

Resolution: Invalid

Looked into it and found out that it is an HBase issue actually. HDFS is good.

 HA namenode with QJM created from 
 org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider 
 should implement Closeable
 --

 Key: HDFS-5566
 URL: https://issues.apache.org/jira/browse/HDFS-5566
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: hadoop-2.2.0
 hbase-0.96
Reporter: Henry Hung
Assignee: Jimmy Xiang

 When using hbase-0.96 with hadoop-2.2.0, stopping master/regionserver node 
 will result in {{Cannot close proxy - is not Closeable or does not provide 
 closeable invocation}}.
 [Mail 
 Archive|https://drive.google.com/file/d/0B22pkxoqCdvWSGFIaEpfR3lnT2M/edit?usp=sharing]
 My hadoop-2.2.0 configured as HA namenode with QJM, the configuration is like 
 this:
 {code:xml}
   property
 namedfs.nameservices/name
 valuehadoopdev/value
   /property
   property
 namedfs.ha.namenodes.hadoopdev/name
 valuenn1,nn2/value
   /property
   property
 namedfs.namenode.rpc-address.hadoopdev.nn1/name
 valuefphd9.ctpilot1.com:9000/value
   /property
   property
 namedfs.namenode.http-address.hadoopdev.nn1/name
 valuefphd9.ctpilot1.com:50070/value
   /property
   property
 namedfs.namenode.rpc-address.hadoopdev.nn2/name
 valuefphd10.ctpilot1.com:9000/value
   /property
   property
 namedfs.namenode.http-address.hadoopdev.nn2/name
 valuefphd10.ctpilot1.com:50070/value
   /property
   property
 namedfs.namenode.shared.edits.dir/name
 
 valueqjournal://fphd8.ctpilot1.com:8485;fphd9.ctpilot1.com:8485;fphd10.ctpilot1.com:8485/hadoopdev/value
   /property
   property
 namedfs.client.failover.proxy.provider.hadoopdev/name
 
 valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value
   /property
   property
 namedfs.ha.fencing.methods/name
 valueshell(/bin/true)/value
   /property
   property
 namedfs.journalnode.edits.dir/name
 value/data/hadoop/hadoop-data-2/journal/value
   /property
   property
 namedfs.ha.automatic-failover.enabled/name
 valuetrue/value
   /property
   property
 nameha.zookeeper.quorum/name
 valuefphd1.ctpilot1.com:/value
   /property
 {code}
 I traced the code and found out that when stopping the hbase master node, it 
 will try invoke method close on namenode, but the instance that created 
 from {{org.apache.hadoop.hdfs.NameNodeProxies.createProxy}} with 
 failoverProxyProviderClass 
 {{org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider}} 
 do not have the Closeable interface.
 If we use the Non-HA case, the created instance will be 
 {{org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB}} that 
 implement Closeable.
 TL;DR;
 With hbase connecting to hadoop HA namenode, when stopping the hbase master 
 or regionserver, it couldn't find the {{close}} method to gracefully close 
 namenode session.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5668:
--

Issue Type: Bug  (was: Task)

 TestBPOfferService.testBPInitErrorHandling fails intermittently
 ---

 Key: HDFS-5668
 URL: https://issues.apache.org/jira/browse/HDFS-5668
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 The new test introduced in HDFS-4201 is a little flaky. I got failures 
 locally occasionally. It could be related to how we did the mockup.
 {noformat}
 Exception in thread DataNode: 
 [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data]
   heartbeating to 0.0.0.0/0.0.0.0:0 
 org.mockito.exceptions.misusing.WrongTypeOfReturnValue:
 SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned 
 by getStorageId()
 getStorageId() should return String
 at 
 org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723)
 2013-12-13 13:42:03,119 DEBUG datanode.DataNode 
 (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service 
 actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1
 at java.lang.Thread.run(Thread.java:722)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5668:
--

Component/s: (was: test)
 namenode

 TestBPOfferService.testBPInitErrorHandling fails intermittently
 ---

 Key: HDFS-5668
 URL: https://issues.apache.org/jira/browse/HDFS-5668
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor

 The new test introduced in HDFS-4201 is a little flaky. I got failures 
 locally occasionally. It could be related to how we did the mockup.
 {noformat}
 Exception in thread DataNode: 
 [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data]
   heartbeating to 0.0.0.0/0.0.0.0:0 
 org.mockito.exceptions.misusing.WrongTypeOfReturnValue:
 SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned 
 by getStorageId()
 getStorageId() should return String
 at 
 org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723)
 2013-12-13 13:42:03,119 DEBUG datanode.DataNode 
 (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service 
 actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1
 at java.lang.Thread.run(Thread.java:722)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5668:
--

Status: Patch Available  (was: Open)

 TestBPOfferService.testBPInitErrorHandling fails intermittently
 ---

 Key: HDFS-5668
 URL: https://issues.apache.org/jira/browse/HDFS-5668
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5668.patch


 The new test introduced in HDFS-4201 is a little flaky. I got failures 
 locally occasionally. It could be related to how we did the mockup.
 {noformat}
 Exception in thread DataNode: 
 [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data]
   heartbeating to 0.0.0.0/0.0.0.0:0 
 org.mockito.exceptions.misusing.WrongTypeOfReturnValue:
 SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned 
 by getStorageId()
 getStorageId() should return String
 at 
 org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723)
 2013-12-13 13:42:03,119 DEBUG datanode.DataNode 
 (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service 
 actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1
 at java.lang.Thread.run(Thread.java:722)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5668:
--

Attachment: trunk-5668.patch

It turns out to be a bug. BPOfferServer#toString is not synchronized so that it 
could see partially initialized dn/bpNSInfo.

 TestBPOfferService.testBPInitErrorHandling fails intermittently
 ---

 Key: HDFS-5668
 URL: https://issues.apache.org/jira/browse/HDFS-5668
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5668.patch


 The new test introduced in HDFS-4201 is a little flaky. I got failures 
 locally occasionally. It could be related to how we did the mockup.
 {noformat}
 Exception in thread DataNode: 
 [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data]
   heartbeating to 0.0.0.0/0.0.0.0:0 
 org.mockito.exceptions.misusing.WrongTypeOfReturnValue:
 SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned 
 by getStorageId()
 getStorageId() should return String
 at 
 org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723)
 2013-12-13 13:42:03,119 DEBUG datanode.DataNode 
 (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service 
 actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1
 at java.lang.Thread.run(Thread.java:722)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5668:
--

Resolution: Duplicate
Status: Resolved  (was: Patch Available)

 TestBPOfferService.testBPInitErrorHandling fails intermittently
 ---

 Key: HDFS-5668
 URL: https://issues.apache.org/jira/browse/HDFS-5668
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5668.patch


 The new test introduced in HDFS-4201 is a little flaky. I got failures 
 locally occasionally. It could be related to how we did the mockup.
 {noformat}
 Exception in thread DataNode: 
 [file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data]
   heartbeating to 0.0.0.0/0.0.0.0:0 
 org.mockito.exceptions.misusing.WrongTypeOfReturnValue:
 SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned 
 by getStorageId()
 getStorageId() should return String
 at 
 org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723)
 2013-12-13 13:42:03,119 DEBUG datanode.DataNode 
 (BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service 
 actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1
 at java.lang.Thread.run(Thread.java:722)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5666:
--

Component/s: (was: test)
 namenode

 TestBPOfferService#/testBPInitErrorHandling fails intermittently
 

 Key: HDFS-5666
 URL: https://issues.apache.org/jira/browse/HDFS-5666
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Colin Patrick McCabe
Priority: Minor

 Intermittent failure on this test:
 {code}
 Regression
 org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling
 Failing for the past 1 build (Since #5698 )
 Took 0.16 sec.
 Error Message
 expected:1 but was:2
 Stacktrace
 java.lang.AssertionError: expected:1 but was:2
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 {code}
 see 
 https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Assigned] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HDFS-5666:
-

Assignee: Jimmy Xiang

 TestBPOfferService#/testBPInitErrorHandling fails intermittently
 

 Key: HDFS-5666
 URL: https://issues.apache.org/jira/browse/HDFS-5666
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Colin Patrick McCabe
Assignee: Jimmy Xiang
Priority: Minor

 Intermittent failure on this test:
 {code}
 Regression
 org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling
 Failing for the past 1 build (Since #5698 )
 Took 0.16 sec.
 Error Message
 expected:1 but was:2
 Stacktrace
 java.lang.AssertionError: expected:1 but was:2
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 {code}
 see 
 https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5666:
--

Attachment: trunk-5666.patch

It's a bug in BPOfferService#toString which is not synchronized, so it got some 
partially initialized info.

 TestBPOfferService#/testBPInitErrorHandling fails intermittently
 

 Key: HDFS-5666
 URL: https://issues.apache.org/jira/browse/HDFS-5666
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Colin Patrick McCabe
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5666.patch


 Intermittent failure on this test:
 {code}
 Regression
 org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling
 Failing for the past 1 build (Since #5698 )
 Took 0.16 sec.
 Error Message
 expected:1 but was:2
 Stacktrace
 java.lang.AssertionError: expected:1 but was:2
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 {code}
 see 
 https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5666:
--

Status: Patch Available  (was: Open)

 TestBPOfferService#/testBPInitErrorHandling fails intermittently
 

 Key: HDFS-5666
 URL: https://issues.apache.org/jira/browse/HDFS-5666
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Colin Patrick McCabe
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5666.patch


 Intermittent failure on this test:
 {code}
 Regression
 org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling
 Failing for the past 1 build (Since #5698 )
 Took 0.16 sec.
 Error Message
 expected:1 but was:2
 Stacktrace
 java.lang.AssertionError: expected:1 but was:2
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 {code}
 see 
 https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848743#comment-13848743
 ] 

Jimmy Xiang commented on HDFS-5666:
---

With the patch, I don't see the test fails locally for quite some runs.

 TestBPOfferService#/testBPInitErrorHandling fails intermittently
 

 Key: HDFS-5666
 URL: https://issues.apache.org/jira/browse/HDFS-5666
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Colin Patrick McCabe
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5666.patch


 Intermittent failure on this test:
 {code}
 Regression
 org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling
 Failing for the past 1 build (Since #5698 )
 Took 0.16 sec.
 Error Message
 expected:1 but was:2
 Stacktrace
 java.lang.AssertionError: expected:1 but was:2
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 {code}
 see 
 https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5666:
--

Attachment: trunk-5666_v2.patch

 TestBPOfferService#/testBPInitErrorHandling fails intermittently
 

 Key: HDFS-5666
 URL: https://issues.apache.org/jira/browse/HDFS-5666
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Colin Patrick McCabe
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5666.patch, trunk-5666_v2.patch


 Intermittent failure on this test:
 {code}
 Regression
 org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling
 Failing for the past 1 build (Since #5698 )
 Took 0.16 sec.
 Error Message
 expected:1 but was:2
 Stacktrace
 java.lang.AssertionError: expected:1 but was:2
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 {code}
 see 
 https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5666) TestBPOfferService#/testBPInitErrorHandling fails intermittently

2013-12-15 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848788#comment-13848788
 ] 

Jimmy Xiang commented on HDFS-5666:
---

Agree. Attached patch v2.

 TestBPOfferService#/testBPInitErrorHandling fails intermittently
 

 Key: HDFS-5666
 URL: https://issues.apache.org/jira/browse/HDFS-5666
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Colin Patrick McCabe
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: trunk-5666.patch, trunk-5666_v2.patch


 Intermittent failure on this test:
 {code}
 Regression
 org.apache.hadoop.hdfs.server.datanode.TestBPOfferService.testBPInitErrorHandling
 Failing for the past 1 build (Since #5698 )
 Took 0.16 sec.
 Error Message
 expected:1 but was:2
 Stacktrace
 java.lang.AssertionError: expected:1 but was:2
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 {code}
 see 
 https://builds.apache.org/job/PreCommit-HDFS-Build/5698//testReport/org.apache.hadoop.hdfs.server.datanode/TestBPOfferService/testBPInitErrorHandling/



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HDFS-5668) TestBPOfferService.testBPInitErrorHandling fails intermittently

2013-12-13 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HDFS-5668:
-

 Summary: TestBPOfferService.testBPInitErrorHandling fails 
intermittently
 Key: HDFS-5668
 URL: https://issues.apache.org/jira/browse/HDFS-5668
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 3.0.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor


The new test introduced in HDFS-4201 is a little flaky. I got failures locally 
occasionally. It could be related to how we did the mockup.

{noformat}
Exception in thread DataNode: 
[file:/home/.../hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testBPInitErrorHandling/data]
  heartbeating to 0.0.0.0/0.0.0.0:0 
org.mockito.exceptions.misusing.WrongTypeOfReturnValue:
SimulatedFSDataset$$EnhancerByMockitoWithCGLIB$$5cb7c720 cannot be returned by 
getStorageId()
getStorageId() should return String
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.toString(BPOfferService.java:178)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.toString(BPServiceActor.java:133)
at java.lang.String.valueOf(String.java:2854)
at java.lang.StringBuilder.append(StringBuilder.java:128)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:723)
2013-12-13 13:42:03,119 DEBUG datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(468)) - Sending heartbeat from service 
actor: Block pool fake bpid (storage id null) service to 0.0.0.0/0.0.0.0:1
at java.lang.Thread.run(Thread.java:722)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5566) HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable

2013-12-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848222#comment-13848222
 ] 

Jimmy Xiang commented on HDFS-5566:
---

In NameNodeProxies#createProxy, it creates a proxy with interface 
ClientProtocol which is not Closable, and a RetryInvocationHandler, for the HA 
case.  However, ConfiguredFailoverProxyProvider doesn't have field h, the 
InvocationHandler, which is the problem.

I think this is a valid bug we need to fix.

 HA namenode with QJM created from 
 org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider 
 should implement Closeable
 --

 Key: HDFS-5566
 URL: https://issues.apache.org/jira/browse/HDFS-5566
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: hadoop-2.2.0
 hbase-0.96
Reporter: Henry Hung

 When using hbase-0.96 with hadoop-2.2.0, stopping master/regionserver node 
 will result in {{Cannot close proxy - is not Closeable or does not provide 
 closeable invocation}}.
 [Mail 
 Archive|https://drive.google.com/file/d/0B22pkxoqCdvWSGFIaEpfR3lnT2M/edit?usp=sharing]
 My hadoop-2.2.0 configured as HA namenode with QJM, the configuration is like 
 this:
 {code:xml}
   property
 namedfs.nameservices/name
 valuehadoopdev/value
   /property
   property
 namedfs.ha.namenodes.hadoopdev/name
 valuenn1,nn2/value
   /property
   property
 namedfs.namenode.rpc-address.hadoopdev.nn1/name
 valuefphd9.ctpilot1.com:9000/value
   /property
   property
 namedfs.namenode.http-address.hadoopdev.nn1/name
 valuefphd9.ctpilot1.com:50070/value
   /property
   property
 namedfs.namenode.rpc-address.hadoopdev.nn2/name
 valuefphd10.ctpilot1.com:9000/value
   /property
   property
 namedfs.namenode.http-address.hadoopdev.nn2/name
 valuefphd10.ctpilot1.com:50070/value
   /property
   property
 namedfs.namenode.shared.edits.dir/name
 
 valueqjournal://fphd8.ctpilot1.com:8485;fphd9.ctpilot1.com:8485;fphd10.ctpilot1.com:8485/hadoopdev/value
   /property
   property
 namedfs.client.failover.proxy.provider.hadoopdev/name
 
 valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value
   /property
   property
 namedfs.ha.fencing.methods/name
 valueshell(/bin/true)/value
   /property
   property
 namedfs.journalnode.edits.dir/name
 value/data/hadoop/hadoop-data-2/journal/value
   /property
   property
 namedfs.ha.automatic-failover.enabled/name
 valuetrue/value
   /property
   property
 nameha.zookeeper.quorum/name
 valuefphd1.ctpilot1.com:/value
   /property
 {code}
 I traced the code and found out that when stopping the hbase master node, it 
 will try invoke method close on namenode, but the instance that created 
 from {{org.apache.hadoop.hdfs.NameNodeProxies.createProxy}} with 
 failoverProxyProviderClass 
 {{org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider}} 
 do not have the Closeable interface.
 If we use the Non-HA case, the created instance will be 
 {{org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB}} that 
 implement Closeable.
 TL;DR;
 With hbase connecting to hadoop HA namenode, when stopping the hbase master 
 or regionserver, it couldn't find the {{close}} method to gracefully close 
 namenode session.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Reopened] (HDFS-5566) HA namenode with QJM created from org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider should implement Closeable

2013-12-13 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reopened HDFS-5566:
---

  Assignee: Jimmy Xiang

 HA namenode with QJM created from 
 org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider 
 should implement Closeable
 --

 Key: HDFS-5566
 URL: https://issues.apache.org/jira/browse/HDFS-5566
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: hadoop-2.2.0
 hbase-0.96
Reporter: Henry Hung
Assignee: Jimmy Xiang

 When using hbase-0.96 with hadoop-2.2.0, stopping master/regionserver node 
 will result in {{Cannot close proxy - is not Closeable or does not provide 
 closeable invocation}}.
 [Mail 
 Archive|https://drive.google.com/file/d/0B22pkxoqCdvWSGFIaEpfR3lnT2M/edit?usp=sharing]
 My hadoop-2.2.0 configured as HA namenode with QJM, the configuration is like 
 this:
 {code:xml}
   property
 namedfs.nameservices/name
 valuehadoopdev/value
   /property
   property
 namedfs.ha.namenodes.hadoopdev/name
 valuenn1,nn2/value
   /property
   property
 namedfs.namenode.rpc-address.hadoopdev.nn1/name
 valuefphd9.ctpilot1.com:9000/value
   /property
   property
 namedfs.namenode.http-address.hadoopdev.nn1/name
 valuefphd9.ctpilot1.com:50070/value
   /property
   property
 namedfs.namenode.rpc-address.hadoopdev.nn2/name
 valuefphd10.ctpilot1.com:9000/value
   /property
   property
 namedfs.namenode.http-address.hadoopdev.nn2/name
 valuefphd10.ctpilot1.com:50070/value
   /property
   property
 namedfs.namenode.shared.edits.dir/name
 
 valueqjournal://fphd8.ctpilot1.com:8485;fphd9.ctpilot1.com:8485;fphd10.ctpilot1.com:8485/hadoopdev/value
   /property
   property
 namedfs.client.failover.proxy.provider.hadoopdev/name
 
 valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value
   /property
   property
 namedfs.ha.fencing.methods/name
 valueshell(/bin/true)/value
   /property
   property
 namedfs.journalnode.edits.dir/name
 value/data/hadoop/hadoop-data-2/journal/value
   /property
   property
 namedfs.ha.automatic-failover.enabled/name
 valuetrue/value
   /property
   property
 nameha.zookeeper.quorum/name
 valuefphd1.ctpilot1.com:/value
   /property
 {code}
 I traced the code and found out that when stopping the hbase master node, it 
 will try invoke method close on namenode, but the instance that created 
 from {{org.apache.hadoop.hdfs.NameNodeProxies.createProxy}} with 
 failoverProxyProviderClass 
 {{org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider}} 
 do not have the Closeable interface.
 If we use the Non-HA case, the created instance will be 
 {{org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB}} that 
 implement Closeable.
 TL;DR;
 With hbase connecting to hadoop HA namenode, when stopping the hbase master 
 or regionserver, it couldn't find the {{close}} method to gracefully close 
 namenode session.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-11 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Priority: Minor  (was: Major)

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor

 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-11 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Attachment: trunk-5350.patch

Attached a patch that added metrics for fsimage downloaded/uploaded.

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-11 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-5350:
--

Fix Version/s: 3.0.0
   Status: Patch Available  (was: Open)

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-11 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845805#comment-13845805
 ] 

Jimmy Xiang commented on HDFS-5350:
---

I tested the patch on my cluster. Here is the new metrics from the jmx page:
{noformat}
GetImageNumOps : 56,
GetImageAvgTime : 3.75,
PutImageNumOps : 51,
PutImageAvgTime : 80.0
{noformat}

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang
Priority: Minor
 Fix For: 3.0.0

 Attachments: trunk-5350.patch


 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-10 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4201:
--

Status: Patch Available  (was: Open)

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch, 
 trunk-4201_v3.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-10 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4201:
--

Status: Open  (was: Patch Available)

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch, 
 trunk-4201_v3.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-10 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1381#comment-1381
 ] 

Jimmy Xiang commented on HDFS-4201:
---

The javadoc warnings are not related to patch: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5685/artifact/trunk/patchprocess/patchJavadocWarnings.txt

The audit warning is due to some memory issue:
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 172104 bytes for 
Arena::Amalloc
# An error report file with more information is saved as:
# 
/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/trunk/hs_err_pid24616.log

Trying the hadoop-qa again.

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch, 
 trunk-4201_v3.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-10 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844678#comment-13844678
 ] 

Jimmy Xiang commented on HDFS-4201:
---

The test failure is not related to the patch: Problem binding to 
[0.0.0.0:50010] java.net.BindException. It works fine locally and in previous 
build.

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch, 
 trunk-4201_v3.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Assigned] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-10 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HDFS-5350:
-

Assignee: Jimmy Xiang  (was: Andrew Wang)

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang

 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5350) Name Node should report fsimage transfer time as a metric

2013-12-10 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844884#comment-13844884
 ] 

Jimmy Xiang commented on HDFS-5350:
---

Instead of some sliding window metrics, I will add two MutableRate metrics for 
fsimage upload and download latency.  From these information, we can tell if 
fsimage transfer is normal or not too.

 Name Node should report fsimage transfer time as a metric
 -

 Key: HDFS-5350
 URL: https://issues.apache.org/jira/browse/HDFS-5350
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Rob Weltman
Assignee: Jimmy Xiang

 If the (Secondary) Name Node reported fsimage transfer times (perhaps the 
 last ten of them), monitoring tools could detect slowdowns that might 
 jeopardize cluster stability.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-09 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843458#comment-13843458
 ] 

Jimmy Xiang commented on HDFS-4201:
---

That's another solution I considered. With try+finally, we need to catch all 
known and unknown exceptions thrown by initBlockPool, then re-throw it, which 
may not look very good. Is there any known initialization patch change coming 
soon?

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-09 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843622#comment-13843622
 ] 

Jimmy Xiang commented on HDFS-4201:
---

Sure, I will do as suggested so that we can minimize the changes. Thanks.

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-09 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4201:
--

Attachment: trunk-4201_v3.patch

Attached v3 that isolated the changes as suggested.

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch, 
 trunk-4201_v3.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-06 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4201:
--

Status: Open  (was: Patch Available)

Looking into the test failures.

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-06 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4201:
--

Status: Patch Available  (was: Open)

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-06 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HDFS-4201:
--

Attachment: trunk-4201_v2.patch

Fixed the test failures. Also enhanced the fix a little so that we register the 
block pool after the datanode initialization is done.

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical
 Fix For: 3.0.0

 Attachments: trunk-4201.patch, trunk-4201_v2.patch


 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (HDFS-4201) NPE in BPServiceActor#sendHeartBeat

2013-12-05 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang reassigned HDFS-4201:
-

Assignee: Jimmy Xiang

 NPE in BPServiceActor#sendHeartBeat
 ---

 Key: HDFS-4201
 URL: https://issues.apache.org/jira/browse/HDFS-4201
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Eli Collins
Assignee: Jimmy Xiang
Priority: Critical

 Saw the following NPE in a log.
 Think this is likely due to {{dn}} or {{dn.getFSDataset()}} being null, (not 
 {{bpRegistration}}) due to a configuration or local directory failure.
 {code}
 2012-09-25 04:33:20,782 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 For namenode svsrs00127/11.164.162.226:8020 using DELETEREPORT_INTERVAL of 
 30 msec  BLOCKREPORT_INTERVAL of 2160msec Initial delay: 0msec; 
 heartBeatInterval=3000
 2012-09-25 04:33:20,782 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
 for Block pool BP-1678908700-11.164.162.226-1342785481826 (storage id 
 DS-1031100678-11.164.162.251-5010-1341933415989) service to 
 svsrs00127/11.164.162.226:8020
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:434)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:520)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


  1   2   >