[jira] [Commented] (HDFS-1490) TransferFSImage should timeout

2012-08-28 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13442992#comment-13442992
 ] 

Ravi Prakash commented on HDFS-1490:


+1 lgtm

 TransferFSImage should timeout
 --

 Key: HDFS-1490
 URL: https://issues.apache.org/jira/browse/HDFS-1490
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov
Priority: Minor
 Attachments: HDFS-1490.patch, HDFS-1490.patch


 Sometimes when primary crashes during image transfer secondary namenode would 
 hang trying to read the image from HTTP connection forever.
 It would be great to set timeouts on the connection so if something like that 
 happens there is no need to restart the secondary itself.
 In our case restarting components is handled by the set of scripts and since 
 the Secondary as the process is running it would just stay hung until we get 
 an alarm saying the checkpointing doesn't happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3540) Further improvement on recovery mode and edit log toleration in branch-1

2012-08-28 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443002#comment-13443002
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3540:
--

If the edit log is not corrupted, both recovery mode and edit log toleration 
are not useful.  Note that recovery mode here means recovery mode in branch-1 
but not the one in trunk.

When an edit log is corrupted, NN cannot start up normally.  We compare 
recovery mode and edit log toleration below.

*Recovery Mode*
- Recovery here means starting NN with a corrupted edit log.  It is unable to 
recover the corrupted edit log or transactions.
- There is a namenode command option hadoop namenode -recover to enter 
recovery mode.
- When reading the first corrupted transaction in the edit log, it prompts the 
admin to either stop reading or quit without saving.
- If stop reading is selected, NN ignores the remaining edit log (from the 
first corrupted transaction to the end of the edit log) and then starts up as 
usual.
- There is a -force option to FORCE_FIRST_CHOICE, i.e. it is a 
non-interactive mode.
- If there is a stray OP_INVALID byte, it could be misinterpreted as an 
end-of-log and lead to silent data loss.  Recovery Mode does not help.
(Please help out if I have missed anything.)

*Edit Log Toleration*
- It has a conf property dfs.namenode.edits.toleration.length for setting the 
toleration length.
- The default toleration length is -1, i.e. disable it.  The feature is enabled 
when the value = 0.
- When the feature is enabled, it always reads the entire edit log, computes 
read length, corruption length and padding length and shows the following 
summary
{noformat}
2012-08-27 22:04:38,625 INFO  - Checked the bytes after the end of edit log
  (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits):
2012-08-27 22:04:38,625 INFO  -   Padding position  = 876 (-1 means padding not 
found)
2012-08-27 22:04:38,625 INFO  -   Edit log length   = 1065
2012-08-27 22:04:38,625 INFO  -   Read length   = 168
2012-08-27 22:04:38,625 INFO  -   Corruption length = 708
2012-08-27 22:04:38,625 INFO  -   Toleration length = 1024 (= 
dfs.namenode.edits.toleration.length)
2012-08-27 22:04:38,626 INFO  - Summary: |-- Read=168 --|-- 
Corrupt=708 --|-- Pad=189 --|
2012-08-27 22:04:38,626 WARN  - Edit log corruption detected:
  corruption length = 708 = toleration length = 1024; the corruption is 
tolerable.
{noformat}
- When toleration length is set to 0, it makes sure that there is no corruption 
in the entire log, including the padding.  A stray OP_INVALID byte won't be 
misinterpreted as an end-of-log.
- When toleration length is set to 0, NN starts up only if corruption length 
= toleration length.  If corruption length  toleration length, it throws an 
exception as below
{noformat}
2012-08-27 22:04:39,123 INFO  - Start checking end of edit log 
(/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits) ...
2012-08-27 22:04:39,123 DEBUG - found: bytes[0]=0xFF=pad, firstPadPos=169
2012-08-27 22:04:39,123 DEBUG - reset: bytes[1410]=0xAB, pad=0xFF
2012-08-27 22:04:39,124 DEBUG - found: bytes[1411]=0xFF=pad, firstPadPos=1580
2012-08-27 22:04:39,124 INFO  - Checked the bytes after the end of edit log 
(/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits):
2012-08-27 22:04:39,124 INFO  -   Padding position  = 1580 (-1 means padding 
not found)
2012-08-27 22:04:39,124 INFO  -   Edit log length   = 2638
2012-08-27 22:04:39,124 INFO  -   Read length   = 169
2012-08-27 22:04:39,124 INFO  -   Corruption length = 1411
2012-08-27 22:04:39,124 INFO  -   Toleration length = 1024 (= 
dfs.namenode.edits.toleration.length)
2012-08-27 22:04:39,125 INFO  - Summary: |-- Read=169 --|-- 
Corrupt=1411 --|-- Pad=1058 --|
2012-08-27 22:04:39,125 ERROR - FSNamesystem initialization failed.
java.io.IOException: Edit log corruption detected:
  corruption length = 1411  toleration length = 1024; the corruption is 
intolerable.
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkEndOfLog(FSEditLog.java:609)
...
{noformat}
- Therefore, the recommanded setting is to set the conf to 0 (or a small 
number).  When corruption is detected (i.e. NN cannot start up), the corruption 
length can be read from the log.  Then, admin could decide whether they are 
willing to tolerate the corruption or they could try to recover the edit log by 
other means.

 Further improvement on recovery mode and edit log toleration in branch-1
 

 Key: HDFS-3540
 URL: https://issues.apache.org/jira/browse/HDFS-3540
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.2.0
Reporter: Tsz Wo (Nicholas), SZE
Assignee: 

[jira] [Updated] (HDFS-1490) TransferFSImage should timeout

2012-08-28 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1490:
--

Status: Patch Available  (was: Open)

 TransferFSImage should timeout
 --

 Key: HDFS-1490
 URL: https://issues.apache.org/jira/browse/HDFS-1490
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov
Priority: Minor
 Attachments: HDFS-1490.patch, HDFS-1490.patch


 Sometimes when primary crashes during image transfer secondary namenode would 
 hang trying to read the image from HTTP connection forever.
 It would be great to set timeouts on the connection so if something like that 
 happens there is no need to restart the secondary itself.
 In our case restarting components is handled by the set of scripts and since 
 the Secondary as the process is running it would just stay hung until we get 
 an alarm saying the checkpointing doesn't happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1490) TransferFSImage should timeout

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443034#comment-13443034
 ] 

Hadoop QA commented on HDFS-1490:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542723/HDFS-1490.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3105//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3105//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3105//console

This message is automatically generated.

 TransferFSImage should timeout
 --

 Key: HDFS-1490
 URL: https://issues.apache.org/jira/browse/HDFS-1490
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov
Priority: Minor
 Attachments: HDFS-1490.patch, HDFS-1490.patch


 Sometimes when primary crashes during image transfer secondary namenode would 
 hang trying to read the image from HTTP connection forever.
 It would be great to set timeouts on the connection so if something like that 
 happens there is no need to restart the secondary itself.
 In our case restarting components is handled by the set of scripts and since 
 the Secondary as the process is running it would just stay hung until we get 
 an alarm saying the checkpointing doesn't happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1490) TransferFSImage should timeout

2012-08-28 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443051#comment-13443051
 ] 

Vinay commented on HDFS-1490:
-

{code}  Call to equals() comparing different types in 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock){code}
Find bug warning unrelated to current patch

{code}Failed tests:   
testHdfsDelegationToken(org.apache.hadoop.hdfs.TestHftpDelegationToken): wrong 
tokens in user expected:2 but was:1{code}
Also test failure is unrelated to current patch

 TransferFSImage should timeout
 --

 Key: HDFS-1490
 URL: https://issues.apache.org/jira/browse/HDFS-1490
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov
Priority: Minor
 Attachments: HDFS-1490.patch, HDFS-1490.patch


 Sometimes when primary crashes during image transfer secondary namenode would 
 hang trying to read the image from HTTP connection forever.
 It would be great to set timeouts on the connection so if something like that 
 happens there is no need to restart the secondary itself.
 In our case restarting components is handled by the set of scripts and since 
 the Secondary as the process is running it would just stay hung until we get 
 an alarm saying the checkpointing doesn't happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HDFS-3847) using NFS As a shared storage for NameNode HA , how to ensure that only one write

2012-08-28 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K reassigned HDFS-3847:
---

Assignee: (was: Devaraj K)

 using NFS As a shared storage for NameNode HA , how to ensure that only one 
 write
 -

 Key: HDFS-3847
 URL: https://issues.apache.org/jira/browse/HDFS-3847
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.0.0-alpha, 2.0.1-alpha
Reporter: liaowenrui
Priority: Critical
 Fix For: 2.0.0-alpha




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HDFS-3847) using NFS As a shared storage for NameNode HA , how to ensure that only one write

2012-08-28 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K reassigned HDFS-3847:
---

Assignee: Devaraj K

 using NFS As a shared storage for NameNode HA , how to ensure that only one 
 write
 -

 Key: HDFS-3847
 URL: https://issues.apache.org/jira/browse/HDFS-3847
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.0.0-alpha, 2.0.1-alpha
Reporter: liaowenrui
Assignee: Devaraj K
Priority: Critical
 Fix For: 2.0.0-alpha




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443056#comment-13443056
 ] 

Suresh Srinivas commented on HDFS-3860:
---

Jing, nice find. Submitting the patch.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443058#comment-13443058
 ] 

Suresh Srinivas commented on HDFS-3860:
---

BTW could you please also ensure that this pattern of code is not repeated in 
any other places.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated HDFS-3860:
--

Status: Patch Available  (was: Open)

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes

2012-08-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443075#comment-13443075
 ] 

Suresh Srinivas commented on HDFS-3791:
---

Uma sorry for the delay in reviewing this. +1 for the patch.

 Backport HDFS-173 to Branch-1 :  Recursively deleting a directory with 
 millions of files makes NameNode unresponsive for other commands until the 
 deletion completes
 

 Key: HDFS-3791
 URL: https://issues.apache.org/jira/browse/HDFS-3791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.0.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
 Attachments: HDFS-3791.patch, HDFS-3791.patch


 Backport HDFS-173. 
 see the 
 [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007]
  for more details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes

2012-08-28 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated HDFS-3791:
--

Attachment: HDFS-3791.patch

Rebased the patch on latest branch-1

 Backport HDFS-173 to Branch-1 :  Recursively deleting a directory with 
 millions of files makes NameNode unresponsive for other commands until the 
 deletion completes
 

 Key: HDFS-3791
 URL: https://issues.apache.org/jira/browse/HDFS-3791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.0.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch


 Backport HDFS-173. 
 see the 
 [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007]
  for more details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes

2012-08-28 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas resolved HDFS-3791.
---

   Resolution: Fixed
Fix Version/s: 1.2.0
 Hadoop Flags: Reviewed

I committed the patch. Thank you Uma.

 Backport HDFS-173 to Branch-1 :  Recursively deleting a directory with 
 millions of files makes NameNode unresponsive for other commands until the 
 deletion completes
 

 Key: HDFS-3791
 URL: https://issues.apache.org/jira/browse/HDFS-3791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.0.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
 Fix For: 1.2.0

 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch


 Backport HDFS-173. 
 see the 
 [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007]
  for more details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes

2012-08-28 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443103#comment-13443103
 ] 

Uma Maheswara Rao G commented on HDFS-3791:
---

Oh, I have just seen the comments.
{quote}
Uma sorry for the delay in reviewing this. +1 for the patch.
{quote}
No problem :-). Thanks a lot, Suresh for the reviews.
Also thanks for rebasing it. I will to get a patch for HDFS-2815 today in some 
time.

 Backport HDFS-173 to Branch-1 :  Recursively deleting a directory with 
 millions of files makes NameNode unresponsive for other commands until the 
 deletion completes
 

 Key: HDFS-3791
 URL: https://issues.apache.org/jira/browse/HDFS-3791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.0.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
 Fix For: 1.2.0

 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch


 Backport HDFS-173. 
 see the 
 [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007]
  for more details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443105#comment-13443105
 ] 

Hadoop QA commented on HDFS-3860:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542695/HDFS-3860.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3106//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3106//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3106//console

This message is automatically generated.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning

2012-08-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443113#comment-13443113
 ] 

Suresh Srinivas commented on HDFS-3837:
---

Seems to me the findbugs is not fixed by the new patch or is it Jenkins error. 

Fixing this issue quickly will help. Currently all Jenkins reports have 
findbugs -1 for precommit tests.

{noformat}
Call to equals() comparing different types in 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
Bug type EC_UNRELATED_TYPES (click for details) 
In class org.apache.hadoop.hdfs.server.datanode.DataNode
In method 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
Actual type org.apache.hadoop.hdfs.protocol.DatanodeInfo
Expected org.apache.hadoop.hdfs.server.protocol.DatanodeRegistration
Value loaded from id
Value loaded from bpReg
org.apache.hadoop.hdfs.server.protocol.DatanodeRegistration.equals(Object) used 
to determine equality
At DataNode.java:[line 1869]
{noformat}

 Fix DataNode.recoverBlock findbugs warning
 --

 Key: HDFS-3837
 URL: https://issues.apache.org/jira/browse/HDFS-3837
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hdfs-3837.txt, hdfs-3837.txt


 HDFS-2686 introduced the following findbugs warning:
 {noformat}
 Call to equals() comparing different types in 
 org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
 {noformat}
 Both are using DatanodeID#equals but it's a different method because 
 DNR#equals overrides equals for some reason (doesn't change behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3856) TestHDFSServerPorts failure is causing surefire fork failure

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443123#comment-13443123
 ] 

Hudson commented on HDFS-3856:
--

Integrated in Hadoop-Hdfs-trunk #1148 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1148/])
Fixup CHANGELOG for HDFS-3856. (Revision 1377936)
HDFS-3856. TestHDFSServerPorts failure is causing surefire fork failure. 
Contributed by Colin Patrick McCabe (Revision 1377934)

 Result = FAILURE
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377936
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377934
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java


 TestHDFSServerPorts failure is causing surefire fork failure
 

 Key: HDFS-3856
 URL: https://issues.apache.org/jira/browse/HDFS-3856
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 2.2.0-alpha
Reporter: Thomas Graves
Assignee: Eli Collins
Priority: Blocker
 Fix For: 2.2.0-alpha

 Attachments: hdfs-3856.txt, hdfs-3856.txt


 We have been seeing the hdfs tests on trunk and branch-2 error out with fork 
 failures.  I see the hadoop jenkins trunk build is also seeing these:
 https://builds.apache.org/view/Hadoop/job/Hadoop-trunk/lastCompletedBuild/console

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3856) TestHDFSServerPorts failure is causing surefire fork failure

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443151#comment-13443151
 ] 

Hudson commented on HDFS-3856:
--

Integrated in Hadoop-Mapreduce-trunk #1179 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1179/])
Fixup CHANGELOG for HDFS-3856. (Revision 1377936)
HDFS-3856. TestHDFSServerPorts failure is causing surefire fork failure. 
Contributed by Colin Patrick McCabe (Revision 1377934)

 Result = FAILURE
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377936
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1377934
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java


 TestHDFSServerPorts failure is causing surefire fork failure
 

 Key: HDFS-3856
 URL: https://issues.apache.org/jira/browse/HDFS-3856
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 2.2.0-alpha
Reporter: Thomas Graves
Assignee: Eli Collins
Priority: Blocker
 Fix For: 2.2.0-alpha

 Attachments: hdfs-3856.txt, hdfs-3856.txt


 We have been seeing the hdfs tests on trunk and branch-2 error out with fork 
 failures.  I see the hadoop jenkins trunk build is also seeing these:
 https://builds.apache.org/view/Hadoop/job/Hadoop-trunk/lastCompletedBuild/console

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225

2012-08-28 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated HDFS-3852:
--

Attachment: HDFS-3852.patch

The test is attempting to insert two tokens with the same service.  The UGI's 
private creds is a list which happily accepted tokens with duplicate services 
and even duplicate tokens.  When I changed UGI in HADOOP-8225 to allow 
extraction of a {{Credentials}} object from the UGI, it broke the test because  
{{Credentials}} uses a map for tokens which naturally doesn't allow for service 
dups.  The test is really trying to ensure the correct token is retrieved for 
htftp so I changed the 2nd token to have a different service to prevent it 
replacing the first token.

Arguably, multiple tokens for the same service with different kinds should be 
permissible.  However in practice that is/was not possible because a 
{{Credentials}} (which doesn't allow service dups) is used to build up tokens 
to be dumped into the UGI.

 TestHftpDelegationToken is broken after HADOOP-8225
 ---

 Key: HDFS-3852
 URL: https://issues.apache.org/jira/browse/HDFS-3852
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, security
Affects Versions: 0.23.3, 2.1.0-alpha
Reporter: Aaron T. Myers
Assignee: Daryn Sharp
 Attachments: HDFS-3852.patch


 It's been failing in all builds for the last 2 days or so. Git bisect 
 indicates that it's due to HADOOP-8225.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225

2012-08-28 Thread Daryn Sharp (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated HDFS-3852:
--

Status: Patch Available  (was: Open)

 TestHftpDelegationToken is broken after HADOOP-8225
 ---

 Key: HDFS-3852
 URL: https://issues.apache.org/jira/browse/HDFS-3852
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, security
Affects Versions: 0.23.3, 2.1.0-alpha
Reporter: Aaron T. Myers
Assignee: Daryn Sharp
 Attachments: HDFS-3852.patch


 It's been failing in all builds for the last 2 days or so. Git bisect 
 indicates that it's due to HADOOP-8225.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225

2012-08-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443212#comment-13443212
 ] 

Aaron T. Myers commented on HDFS-3852:
--

Got it. Makes sense. Thanks for the explanation, Daryn, and thanks for looking 
into this issue.

The patch looks good to me. +1 pending Jenkins.

 TestHftpDelegationToken is broken after HADOOP-8225
 ---

 Key: HDFS-3852
 URL: https://issues.apache.org/jira/browse/HDFS-3852
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, security
Affects Versions: 0.23.3, 2.1.0-alpha
Reporter: Aaron T. Myers
Assignee: Daryn Sharp
 Attachments: HDFS-3852.patch


 It's been failing in all builds for the last 2 days or so. Git bisect 
 indicates that it's due to HADOOP-8225.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0

2012-08-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443221#comment-13443221
 ] 

Robert Joseph Evans commented on HDFS-3731:
---

Any update on branch-0.23?  Do you want me to look into it?

 2.0 release upgrade must handle blocks being written from 1.0
 -

 Key: HDFS-3731
 URL: https://issues.apache.org/jira/browse/HDFS-3731
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Suresh Srinivas
Assignee: Colin Patrick McCabe
Priority: Blocker
 Fix For: 2.2.0-alpha

 Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch


 Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 
 release. Problem reported by Brahma Reddy.
 The {{DataNode}} will only have one block pool after upgrading from a 1.x 
 release.  (This is because in the 1.x releases, there were no block pools-- 
 or equivalently, everything was in the same block pool).  During the upgrade, 
 we should hardlink the block files from the {{blocksBeingWritten}} directory 
 into the {{rbw}} directory of this block pool.  Similarly, on {{-finalize}}, 
 we should delete the {{blocksBeingWritten}} directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning

2012-08-28 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-3837:
--

Attachment: hdfs-3837.txt

The findbugs warning seems bogus - This method calls equals(Object) on two 
references of different class types with no common subclasses. Therefore, the 
objects being compared are unlikely to be members of the same class at 
runtime.  Both DatanodeInfo and DatanodeRegistration extend DatanodeID so they 
 both share the equals implementation.

Anyway, I'll put the relevant code back (cast the array) since this fixes the 
findbugs warning is is fine (just more verbose).

{code}
-DatanodeID[] datanodeids = rBlock.getLocations();
+DatanodeInfo[] targets = rBlock.getLocations();
+DatanodeID[] datanodeids = (DatanodeID[])targets;
{code}

Updated patch, includes the comments as well so it's clear both classes are 
using the same equals method.

 Fix DataNode.recoverBlock findbugs warning
 --

 Key: HDFS-3837
 URL: https://issues.apache.org/jira/browse/HDFS-3837
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt


 HDFS-2686 introduced the following findbugs warning:
 {noformat}
 Call to equals() comparing different types in 
 org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
 {noformat}
 Both are using DatanodeID#equals but it's a different method because 
 DNR#equals overrides equals for some reason (doesn't change behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes

2012-08-28 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443245#comment-13443245
 ] 

Ted Yu commented on HDFS-3791:
--

Currently small deletion is determined by the constant BLOCK_DELETION_INCREMENT:
{code}
+  deleteNow = collectedBlocks.size() = BLOCK_DELETION_INCREMENT;
{code}
I wonder if there is use case where the increment should be configurable.

 Backport HDFS-173 to Branch-1 :  Recursively deleting a directory with 
 millions of files makes NameNode unresponsive for other commands until the 
 deletion completes
 

 Key: HDFS-3791
 URL: https://issues.apache.org/jira/browse/HDFS-3791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.0.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
 Fix For: 1.2.0

 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch


 Backport HDFS-173. 
 see the 
 [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007]
  for more details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3852) TestHftpDelegationToken is broken after HADOOP-8225

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443264#comment-13443264
 ] 

Hadoop QA commented on HDFS-3852:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542779/HDFS-3852.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3107//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3107//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3107//console

This message is automatically generated.

 TestHftpDelegationToken is broken after HADOOP-8225
 ---

 Key: HDFS-3852
 URL: https://issues.apache.org/jira/browse/HDFS-3852
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, security
Affects Versions: 0.23.3, 2.1.0-alpha
Reporter: Aaron T. Myers
Assignee: Daryn Sharp
 Attachments: HDFS-3852.patch


 It's been failing in all builds for the last 2 days or so. Git bisect 
 indicates that it's due to HADOOP-8225.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Kihwal Lee (JIRA)
Kihwal Lee created HDFS-3861:


 Summary: Deadlock in DFSClient
 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha


The deadlock is between DFSOutputStream#close() and DFSClient#close().



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443269#comment-13443269
 ] 

Kihwal Lee commented on HDFS-3861:
--

DFSClient#getLeaseRenewer() doesn't have to be synchronized since 
LeaseManager.Factory methods are synchronized. Multiple callers are still 
guaranteed to get a single live renewer back.


{noformat}
Java stack information for the threads listed above:
===
Thread-28:
at
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1729)
- waiting to lock 0xb5a05dc8 (a
org.apache.hadoop.hdfs.DFSOutputStream)
at
org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:674)
at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:691)
- locked 0xb5a06ed8 (a org.apache.hadoop.hdfs.DFSClient)
at
org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:539)
at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2386)
- locked 0xb44b00e8 (a org.apache.hadoop.fs.FileSystem$Cache)
at
org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2403)
- locked 0xb44b0100 (a
org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer)
at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Thread-1175:
at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:538)
- waiting to lock 0xb5a06ed8 (a org.apache.hadoop.hdfs.DFSClient)
at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:550)
at
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1757)
- locked 0xb5a05dc8 (a org.apache.hadoop.hdfs.DFSOutputStream)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:66)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:99)
at
org.apache.hadoop.hdfs.TestDatanodeDeath$Workload.run(TestDatanodeDeath.java:101)
{noformat}

 Deadlock in DFSClient
 -

 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha

 Attachments: hdfs-3861.patch.txt


 The deadlock is between DFSOutputStream#close() and DFSClient#close().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443271#comment-13443271
 ] 

Aaron T. Myers commented on HDFS-3860:
--

Oof, good catch, Jing. Fortunately this case seems like it would be pretty 
tough to hit, since if the NN is in SM then HeartbeatManager#heartbeatCheck 
will return early, so to hit this the NN would have to enter SM in a very short 
window of time. Still certainly worth fixing, though.

The patch looks good to me. The findbugs warning is unrelated and 
TestHftpDelegationToken is known to currently be failing.

+1, I'll commit this momentarily.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-3861:
-

Attachment: hdfs-3861.patch.txt

 Deadlock in DFSClient
 -

 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha

 Attachments: hdfs-3861.patch.txt


 The deadlock is between DFSOutputStream#close() and DFSClient#close().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-3861:
-

Status: Patch Available  (was: Open)

 Deadlock in DFSClient
 -

 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha

 Attachments: hdfs-3861.patch.txt


 The deadlock is between DFSOutputStream#close() and DFSClient#close().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3860:
-

   Resolution: Fixed
Fix Version/s: 2.2.0-alpha
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've just committed this to trunk and branch-2.

Thanks a lot for the contribution, Jing.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning

2012-08-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443286#comment-13443286
 ] 

Suresh Srinivas commented on HDFS-3837:
---

If this is a findbugs issue, why not just add this to findbugs exclude?

 Fix DataNode.recoverBlock findbugs warning
 --

 Key: HDFS-3837
 URL: https://issues.apache.org/jira/browse/HDFS-3837
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt


 HDFS-2686 introduced the following findbugs warning:
 {noformat}
 Call to equals() comparing different types in 
 org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
 {noformat}
 Both are using DatanodeID#equals but it's a different method because 
 DNR#equals overrides equals for some reason (doesn't change behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443289#comment-13443289
 ] 

Suresh Srinivas commented on HDFS-3860:
---

Thanks Aaron for committing the patch.

bq. BTW could you please also ensure that this pattern of code is not repeated 
in any other places.
Going back to my previous comment, Jing, if possible can you also see if there 
other such issues.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443292#comment-13443292
 ] 

Jing Zhao commented on HDFS-3860:
-

I just checked all the invocation of namesystem#writelock / writeunlock, and 
did not find similar problems. I will check other similar code too.

 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3791) Backport HDFS-173 to Branch-1 : Recursively deleting a directory with millions of files makes NameNode unresponsive for other commands until the deletion completes

2012-08-28 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443296#comment-13443296
 ] 

Suresh Srinivas commented on HDFS-3791:
---

when I added this in trunk, I was not sure if there is a usecase. The whole 
idea was to give up lock once deleting some number of blocks. So the number 
currently is arbitrary.

 Backport HDFS-173 to Branch-1 :  Recursively deleting a directory with 
 millions of files makes NameNode unresponsive for other commands until the 
 deletion completes
 

 Key: HDFS-3791
 URL: https://issues.apache.org/jira/browse/HDFS-3791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.0.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
 Fix For: 1.2.0

 Attachments: HDFS-3791.patch, HDFS-3791.patch, HDFS-3791.patch


 Backport HDFS-173. 
 see the 
 [comment|https://issues.apache.org/jira/browse/HDFS-2815?focusedCommentId=13422007page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422007]
  for more details

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee reassigned HDFS-3861:


Assignee: Kihwal Lee

 Deadlock in DFSClient
 -

 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha

 Attachments: hdfs-3861.patch.txt


 The deadlock is between DFSOutputStream#close() and DFSClient#close().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-2815) Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed.

2012-08-28 Thread Uma Maheswara Rao G (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G updated HDFS-2815:
--

Attachment: HDFS-2815-branch-1.patch

 Namenode is not coming out of safemode when we perform ( NN crash + restart ) 
 .  Also FSCK report shows blocks missed.
 --

 Key: HDFS-2815
 URL: https://issues.apache.org/jira/browse/HDFS-2815
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.22.0, 0.24.0, 0.23.1, 1.0.0, 1.1.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
Priority: Critical
 Fix For: 2.0.0-alpha, 3.0.0

 Attachments: HDFS-2815-22-branch.patch, HDFS-2815-branch-1.patch, 
 HDFS-2815-Branch-1.patch, HDFS-2815.patch, HDFS-2815.patch


 When tested the HA(internal) with continuous switch with some 5mins gap, 
 found some *blocks missed* and namenode went into safemode after next switch.

After the analysis, i found that this files already deleted by clients. 
 But i don't see any delete commands logs namenode log files. But namenode 
 added that blocks to invalidateSets and DNs deleted the blocks.
When restart of the namenode, it went into safemode and expecting some 
 more blocks to come out of safemode.
Here the reason could be that, file has been deleted in memory and added 
 into invalidates after this it is trying to sync the edits into editlog file. 
 By that time NN asked DNs to delete that blocks. Now namenode shuts down 
 before persisting to editlogs.( log behind)
Due to this reason, we may not get the INFO logs about delete, and when we 
 restart the Namenode (in my scenario it is again switch), Namenode expects 
 this deleted blocks also, as delete request is not persisted into editlog 
 before.
I reproduced this scenario with bedug points. *I feel, We should not add 
 the blocks to invalidates before persisting into Editlog*. 
 Note: for switch, we used kill -9 (force kill)
   I am currently in 0.20.2 version. Same verified in 0.23 as well in normal 
 crash + restart  scenario.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3373) FileContext HDFS implementation can leak socket caches

2012-08-28 Thread John George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John George updated HDFS-3373:
--

Status: Open  (was: Patch Available)

 FileContext HDFS implementation can leak socket caches
 --

 Key: HDFS-3373
 URL: https://issues.apache.org/jira/browse/HDFS-3373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 2.0.0-alpha, 3.0.0
Reporter: Todd Lipcon
Assignee: John George
 Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch


 As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, 
 and thus never calls DFSClient.close(). This means that, until finalizers 
 run, DFSClient will hold on to its SocketCache object and potentially have a 
 lot of outstanding sockets/fds held on to.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3373) FileContext HDFS implementation can leak socket caches

2012-08-28 Thread John George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John George updated HDFS-3373:
--

Attachment: HDFS-3373.trunk.patch.1

TestConnCache failure is related to this JIRA. I had moved testDisableCache() 
from that test to another test file because now it is not possible to change 
cache config per DFS. 

TestHftpDelegationToken is unrelated to this patch and has been failing in 
other builds as well.

Attaching a patch with testDisableCache() removed from TestConnCache to a new 
file

 FileContext HDFS implementation can leak socket caches
 --

 Key: HDFS-3373
 URL: https://issues.apache.org/jira/browse/HDFS-3373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 2.0.0-alpha, 3.0.0
Reporter: Todd Lipcon
Assignee: John George
 Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch, 
 HDFS-3373.trunk.patch.1


 As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, 
 and thus never calls DFSClient.close(). This means that, until finalizers 
 run, DFSClient will hold on to its SocketCache object and potentially have a 
 lot of outstanding sockets/fds held on to.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3373) FileContext HDFS implementation can leak socket caches

2012-08-28 Thread John George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John George updated HDFS-3373:
--

Status: Patch Available  (was: Open)

 FileContext HDFS implementation can leak socket caches
 --

 Key: HDFS-3373
 URL: https://issues.apache.org/jira/browse/HDFS-3373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 2.0.0-alpha, 3.0.0
Reporter: Todd Lipcon
Assignee: John George
 Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch, 
 HDFS-3373.trunk.patch.1


 As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, 
 and thus never calls DFSClient.close(). This means that, until finalizers 
 run, DFSClient will hold on to its SocketCache object and potentially have a 
 lot of outstanding sockets/fds held on to.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3004) Implement Recovery Mode

2012-08-28 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3004:
---

Attachment: recovery-mode.pdf

Here is an updated Recovery Mode design document.

 Implement Recovery Mode
 ---

 Key: HDFS-3004
 URL: https://issues.apache.org/jira/browse/HDFS-3004
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: tools
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.0.0-alpha

 Attachments: HDFS-3004.010.patch, HDFS-3004.011.patch, 
 HDFS-3004.012.patch, HDFS-3004.013.patch, HDFS-3004.015.patch, 
 HDFS-3004.016.patch, HDFS-3004.017.patch, HDFS-3004.018.patch, 
 HDFS-3004.019.patch, HDFS-3004.020.patch, HDFS-3004.022.patch, 
 HDFS-3004.023.patch, HDFS-3004.024.patch, HDFS-3004.026.patch, 
 HDFS-3004.027.patch, HDFS-3004.029.patch, HDFS-3004.030.patch, 
 HDFS-3004.031.patch, HDFS-3004.032.patch, HDFS-3004.033.patch, 
 HDFS-3004.034.patch, HDFS-3004.035.patch, HDFS-3004.036.patch, 
 HDFS-3004.037.patch, HDFS-3004.038.patch, HDFS-3004.039.patch, 
 HDFS-3004.040.patch, HDFS-3004.041.patch, HDFS-3004.042.patch, 
 HDFS-3004.042.patch, HDFS-3004.042.patch, HDFS-3004.043.patch, 
 HDFS-3004__namenode_recovery_tool.txt, recovery-mode.pdf


 When the NameNode metadata is corrupt for some reason, we want to be able to 
 fix it.  Obviously, we would prefer never to get in this case.  In a perfect 
 world, we never would.  However, bad data on disk can happen from time to 
 time, because of hardware errors or misconfigurations.  In the past we have 
 had to correct it manually, which is time-consuming and which can result in 
 downtime.
 Recovery mode is initialized by the system administrator.  When the NameNode 
 starts up in Recovery Mode, it will try to load the FSImage file, apply all 
 the edits from the edits log, and then write out a new image.  Then it will 
 shut down.
 Unlike in the normal startup process, the recovery mode startup process will 
 be interactive.  When the NameNode finds something that is inconsistent, it 
 will prompt the operator as to what it should do.   The operator can also 
 choose to take the first option for all prompts by starting up with the '-f' 
 flag, or typing 'a' at one of the prompts.
 I have reused as much code as possible from the NameNode in this tool.  
 Hopefully, the effort that was spent developing this will also make the 
 NameNode editLog and image processing even more robust than it already is.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443338#comment-13443338
 ] 

Hudson commented on HDFS-3860:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2680 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2680/])
HDFS-3860. HeartbeatManager#Monitor may wrongly hold the writelock of 
namesystem. Contributed by Jing Zhao. (Revision 1378228)

 Result = FAILURE
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378228
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java


 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443351#comment-13443351
 ] 

Hadoop QA commented on HDFS-3861:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542787/hdfs-3861.patch.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken
  
org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3109//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3109//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3109//console

This message is automatically generated.

 Deadlock in DFSClient
 -

 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha

 Attachments: hdfs-3861.patch.txt


 The deadlock is between DFSOutputStream#close() and DFSClient#close().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443353#comment-13443353
 ] 

Hudson commented on HDFS-3860:
--

Integrated in Hadoop-Common-trunk-Commit #2651 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2651/])
HDFS-3860. HeartbeatManager#Monitor may wrongly hold the writelock of 
namesystem. Contributed by Jing Zhao. (Revision 1378228)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378228
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java


 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3860) HeartbeatManager#Monitor may wrongly hold the writelock of namesystem

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443367#comment-13443367
 ] 

Hudson commented on HDFS-3860:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2715 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2715/])
HDFS-3860. HeartbeatManager#Monitor may wrongly hold the writelock of 
namesystem. Contributed by Jing Zhao. (Revision 1378228)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378228
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java


 HeartbeatManager#Monitor may wrongly hold the writelock of namesystem
 -

 Key: HDFS-3860
 URL: https://issues.apache.org/jira/browse/HDFS-3860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3860.patch, HDFS-heartbeat-testcase.patch


 In HeartbeatManager#heartbeatCheck, if some dead datanode is found, the 
 monitor thread will acquire the write lock of namesystem, and recheck the 
 safemode. If it is in safemode, the monitor thread will return from the 
 heartbeatCheck function without release the write lock. This may cause the 
 monitor thread wrongly holding the write lock forever.
 The attached test case tries to simulate this bad scenario.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3540) Further improvement on recovery mode and edit log toleration in branch-1

2012-08-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443375#comment-13443375
 ] 

Colin Patrick McCabe commented on HDFS-3540:


Hi Nicholas,

Your summary seems reasonable to me overall.  I agree with you that the 
recommended setting for edit log toleration should be disabled.  Is there 
anything left to do for this JIRA?

 Further improvement on recovery mode and edit log toleration in branch-1
 

 Key: HDFS-3540
 URL: https://issues.apache.org/jira/browse/HDFS-3540
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 1.2.0
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE

 *Recovery Mode*: HDFS-3479 backported HDFS-3335 to branch-1.  However, the 
 recovery mode feature in branch-1 is dramatically different from the recovery 
 mode in trunk since the edit log implementations in these two branch are 
 different.  For example, there is UNCHECKED_REGION_LENGTH in branch-1 but not 
 in trunk.
 *Edit Log Toleration*: HDFS-3521 added this feature to branch-1 to remedy 
 UNCHECKED_REGION_LENGTH and to tolerate edit log corruption.
 There are overlaps between these two features.  We study potential further 
 improvement in this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0

2012-08-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443377#comment-13443377
 ] 

Colin Patrick McCabe commented on HDFS-3731:


bq. Any update on branch-0.23? Do you want me to look into it?

There are some differences in the branch-0.23 BlockManager state machine, such 
that a straight port of the patch doesn't work.  The easiest thing to do would 
probably be to backport some of the BlockManager fixes and improvements to 
branch-0.23.  If you would look into that it would be good.

 2.0 release upgrade must handle blocks being written from 1.0
 -

 Key: HDFS-3731
 URL: https://issues.apache.org/jira/browse/HDFS-3731
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Suresh Srinivas
Assignee: Colin Patrick McCabe
Priority: Blocker
 Fix For: 2.2.0-alpha

 Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch


 Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 
 release. Problem reported by Brahma Reddy.
 The {{DataNode}} will only have one block pool after upgrading from a 1.x 
 release.  (This is because in the 1.x releases, there were no block pools-- 
 or equivalently, everything was in the same block pool).  During the upgrade, 
 we should hardlink the block files from the {{blocksBeingWritten}} directory 
 into the {{rbw}} directory of this block pool.  Similarly, on {{-finalize}}, 
 we should delete the {{blocksBeingWritten}} directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443387#comment-13443387
 ] 

Hadoop QA commented on HDFS-3837:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542780/hdfs-3837.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken
  
org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3108//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3108//console

This message is automatically generated.

 Fix DataNode.recoverBlock findbugs warning
 --

 Key: HDFS-3837
 URL: https://issues.apache.org/jira/browse/HDFS-3837
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt


 HDFS-2686 introduced the following findbugs warning:
 {noformat}
 Call to equals() comparing different types in 
 org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
 {noformat}
 Both are using DatanodeID#equals but it's a different method because 
 DNR#equals overrides equals for some reason (doesn't change behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2815) Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed.

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443394#comment-13443394
 ] 

Hadoop QA commented on HDFS-2815:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12542794/HDFS-2815-branch-1.patch
  against trunk revision .

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3111//console

This message is automatically generated.

 Namenode is not coming out of safemode when we perform ( NN crash + restart ) 
 .  Also FSCK report shows blocks missed.
 --

 Key: HDFS-2815
 URL: https://issues.apache.org/jira/browse/HDFS-2815
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.22.0, 0.24.0, 0.23.1, 1.0.0, 1.1.0
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
Priority: Critical
 Fix For: 2.0.0-alpha, 3.0.0

 Attachments: HDFS-2815-22-branch.patch, HDFS-2815-branch-1.patch, 
 HDFS-2815-Branch-1.patch, HDFS-2815.patch, HDFS-2815.patch


 When tested the HA(internal) with continuous switch with some 5mins gap, 
 found some *blocks missed* and namenode went into safemode after next switch.

After the analysis, i found that this files already deleted by clients. 
 But i don't see any delete commands logs namenode log files. But namenode 
 added that blocks to invalidateSets and DNs deleted the blocks.
When restart of the namenode, it went into safemode and expecting some 
 more blocks to come out of safemode.
Here the reason could be that, file has been deleted in memory and added 
 into invalidates after this it is trying to sync the edits into editlog file. 
 By that time NN asked DNs to delete that blocks. Now namenode shuts down 
 before persisting to editlogs.( log behind)
Due to this reason, we may not get the INFO logs about delete, and when we 
 restart the Namenode (in my scenario it is again switch), Namenode expects 
 this deleted blocks also, as delete request is not persisted into editlog 
 before.
I reproduced this scenario with bedug points. *I feel, We should not add 
 the blocks to invalidates before persisting into Editlog*. 
 Note: for switch, we used kill -9 (force kill)
   I am currently in 0.20.2 version. Same verified in 0.23 as well in normal 
 crash + restart  scenario.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443401#comment-13443401
 ] 

Kihwal Lee commented on HDFS-3861:
--

- The test failures are not related to this patch.
- No test was added. Existing test case exposed this bug (TestDataNodeDeath).
- The findbugs warning is not caused by this patch.

 Deadlock in DFSClient
 -

 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha

 Attachments: hdfs-3861.patch.txt


 The deadlock is between DFSOutputStream#close() and DFSClient#close().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.

2012-08-28 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3849:
---

Attachment: HDFS-3849.003.patch

* don't set DT config

 When re-loading the FSImage, we should clear the existing genStamp and leases.
 --

 Key: HDFS-3849
 URL: https://issues.apache.org/jira/browse/HDFS-3849
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Critical
 Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, 
 HDFS-3849.003.patch


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 This is an issue in the 2NN, because it sometimes clears the existing FSImage 
 and reloads a new one in order to get back in sync with the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.

2012-08-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443445#comment-13443445
 ] 

Aaron T. Myers commented on HDFS-3849:
--

+1 pending Jenkins.

 When re-loading the FSImage, we should clear the existing genStamp and leases.
 --

 Key: HDFS-3849
 URL: https://issues.apache.org/jira/browse/HDFS-3849
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Critical
 Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, 
 HDFS-3849.003.patch


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 This is an issue in the 2NN, because it sometimes clears the existing FSImage 
 and reloads a new one in order to get back in sync with the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3861) Deadlock in DFSClient

2012-08-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443463#comment-13443463
 ] 

Colin Patrick McCabe commented on HDFS-3861:


Looks good to me.

 Deadlock in DFSClient
 -

 Key: HDFS-3861
 URL: https://issues.apache.org/jira/browse/HDFS-3861
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.2.0-alpha
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 0.23.4, 3.0.0, 2.2.0-alpha

 Attachments: hdfs-3861.patch.txt


 The deadlock is between DFSOutputStream#close() and DFSClient#close().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3859) QJM: implement md5sum verification

2012-08-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443476#comment-13443476
 ] 

Steve Loughran commented on HDFS-3859:
--

Isn't MD5 overkill? Can't a good CRC (like TCP Jumbo Frames uses) suffice?

 QJM: implement md5sum verification
 --

 Key: HDFS-3859
 URL: https://issues.apache.org/jira/browse/HDFS-3859
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 When the QJM passes journal segments between nodes, it should use an md5sum 
 field to make sure the data doesn't get corrupted during transit. This also 
 serves as an extra safe-guard to make sure that the data is consistent across 
 all nodes when finalizing a segment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3859) QJM: implement md5sum verification

2012-08-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443483#comment-13443483
 ] 

Todd Lipcon commented on HDFS-3859:
---

Sure, it's overkill, but it's not that expensive and we already have an 
implementation of it sitting around. It's also handy because md5sum is 
commonly available on the command line, and we use it for FSImages already as 
well. Performance-wise, my laptop can md5sum at about 500MB/sec, so given that 
log segments under recovery are likely to be much smaller than 500M, I don't 
think we should be concerned about that.

 QJM: implement md5sum verification
 --

 Key: HDFS-3859
 URL: https://issues.apache.org/jira/browse/HDFS-3859
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 When the QJM passes journal segments between nodes, it should use an md5sum 
 field to make sure the data doesn't get corrupted during transit. This also 
 serves as an extra safe-guard to make sure that the data is consistent across 
 all nodes when finalizing a segment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-3862) QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics

2012-08-28 Thread Todd Lipcon (JIRA)
Todd Lipcon created HDFS-3862:
-

 Summary: QJM: don't require a fencer to be configured if shared 
storage has built-in single-writer semantics
 Key: HDFS-3862
 URL: https://issues.apache.org/jira/browse/HDFS-3862
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon


Currently, NN HA requires that the administrator configure a fencing method to 
ensure that only a single NameNode may write to the shared storage at a time. 
Some shared edits storage implementations (like QJM) inherently enforce 
single-writer semantics at the storage level, and thus the user should not be 
forced to specify one.

We should extend the JournalManager interface so that the HA code can operate 
without a configured fencer if the JM has such built-in fencing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3862) QJM: don't require a fencer to be configured if shared storage has built-in single-writer semantics

2012-08-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443518#comment-13443518
 ] 

Todd Lipcon commented on HDFS-3862:
---

I think this might be the case for BookKeeper as well. Any of the folks working 
on BKJM want to take this on? I anticipate we would add a simple API to 
JournalManager like: {{boolean isNativelySingleWriter();}} or {{boolean 
needsExternalFencing();}}. Then the failover code could check the shared 
storage dir to see if this is the case, and if so, not error out if the user 
doesn't specify a fence method.

 QJM: don't require a fencer to be configured if shared storage has built-in 
 single-writer semantics
 ---

 Key: HDFS-3862
 URL: https://issues.apache.org/jira/browse/HDFS-3862
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon

 Currently, NN HA requires that the administrator configure a fencing method 
 to ensure that only a single NameNode may write to the shared storage at a 
 time. Some shared edits storage implementations (like QJM) inherently enforce 
 single-writer semantics at the storage level, and thus the user should not be 
 forced to specify one.
 We should extend the JournalManager interface so that the HA code can operate 
 without a configured fencer if the JM has such built-in fencing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3373) FileContext HDFS implementation can leak socket caches

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443524#comment-13443524
 ] 

Hadoop QA commented on HDFS-3373:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12542795/HDFS-3373.trunk.patch.1
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 2 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3110//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3110//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3110//console

This message is automatically generated.

 FileContext HDFS implementation can leak socket caches
 --

 Key: HDFS-3373
 URL: https://issues.apache.org/jira/browse/HDFS-3373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 2.0.0-alpha, 3.0.0
Reporter: Todd Lipcon
Assignee: John George
 Attachments: HDFS-3373.branch-23.patch, HDFS-3373.trunk.patch, 
 HDFS-3373.trunk.patch.1


 As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, 
 and thus never calls DFSClient.close(). This means that, until finalizers 
 run, DFSClient will hold on to its SocketCache object and potentially have a 
 lot of outstanding sockets/fds held on to.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1490) TransferFSImage should timeout

2012-08-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443526#comment-13443526
 ] 

Todd Lipcon commented on HDFS-1490:
---

- I dont like reusing the ipc ping interval for this timeout here. It's from an 
entirely separate module, and I don't see why one should correlate to the 
other. Why not introduce a new config which defaults to something like 1 minute?
- In the test case, shouldn't you somehow notify the servlet to exit? Currently 
it waits on itself, but nothing notifies it. 


 TransferFSImage should timeout
 --

 Key: HDFS-1490
 URL: https://issues.apache.org/jira/browse/HDFS-1490
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov
Priority: Minor
 Attachments: HDFS-1490.patch, HDFS-1490.patch


 Sometimes when primary crashes during image transfer secondary namenode would 
 hang trying to read the image from HTTP connection forever.
 It would be great to set timeouts on the connection so if something like that 
 happens there is no need to restart the secondary itself.
 In our case restarting components is handled by the set of scripts and since 
 the Secondary as the process is running it would just stay hung until we get 
 an alarm saying the checkpointing doesn't happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443541#comment-13443541
 ] 

Hadoop QA commented on HDFS-3849:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542806/HDFS-3849.003.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3112//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3112//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3112//console

This message is automatically generated.

 When re-loading the FSImage, we should clear the existing genStamp and leases.
 --

 Key: HDFS-3849
 URL: https://issues.apache.org/jira/browse/HDFS-3849
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Critical
 Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, 
 HDFS-3849.003.patch


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 This is an issue in the 2NN, because it sometimes clears the existing FSImage 
 and reloads a new one in order to get back in sync with the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-3863) QJM: track last committed txid

2012-08-28 Thread Todd Lipcon (JIRA)
Todd Lipcon created HDFS-3863:
-

 Summary: QJM: track last committed txid
 Key: HDFS-3863
 URL: https://issues.apache.org/jira/browse/HDFS-3863
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon
Assignee: Todd Lipcon


Per some discussion with [~stepinto] 
[here|https://issues.apache.org/jira/browse/HDFS-3077?focusedCommentId=13422579page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422579],
 we should keep track of the last committed txid on each JournalNode. Then 
during any recovery operation, we can sanity-check that we aren't asked to 
truncate a log to an earlier transaction.

This is also a necessary step if we want to support reading from in-progress 
segments in the future (since we should only allow reads up to the commit point)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0

2012-08-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443551#comment-13443551
 ] 

Robert Joseph Evans commented on HDFS-3731:
---

Do you have a list of ones you know about?  If not I can start pulling on that 
thread tomorrow.

 2.0 release upgrade must handle blocks being written from 1.0
 -

 Key: HDFS-3731
 URL: https://issues.apache.org/jira/browse/HDFS-3731
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Suresh Srinivas
Assignee: Colin Patrick McCabe
Priority: Blocker
 Fix For: 2.2.0-alpha

 Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch


 Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 
 release. Problem reported by Brahma Reddy.
 The {{DataNode}} will only have one block pool after upgrading from a 1.x 
 release.  (This is because in the 1.x releases, there were no block pools-- 
 or equivalently, everything was in the same block pool).  During the upgrade, 
 we should hardlink the block files from the {{blocksBeingWritten}} directory 
 into the {{rbw}} directory of this block pool.  Similarly, on {{-finalize}}, 
 we should delete the {{blocksBeingWritten}} directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3863) QJM: track last committed txid

2012-08-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443556#comment-13443556
 ] 

Todd Lipcon commented on HDFS-3863:
---

The design here is pretty simple, given the way our journaling protocol works. 
In particular, we only have one outstanding batch of transactions at once. We 
never send a batch of transactions beginning at txid N until the prior batch 
(up through N-1) has been accepted at a quorum of nodes. Thus, any 
{{sendEdits()}} call with {{firstTxId}} N implies a {{commit(N-1)}}.

So, my plan is as follows:

- Introduce a new file inside the journal directory called {{committed-txid}}. 
This would include a single numeric text line, similar to the {{seen_txid}} 
that the NameNode maintains.
- Since this whole feature is not required for correctness, we don't need to 
fsync this file on every update. Instead, we can let the operating system write 
it out to disk whenever it so chooses. If, after a system crash, it reverts to 
an earlier value, this is OK, since our recovery protocol doesn't depend on it 
being up-to-date in any way. Put another way, the invariant is that the file 
contains a value which is a lower bound on the latest committed txn.

The data would be when any sendEdits() call is made -- the call implicitly 
commits all edits prior to the current batch.

This alone is enough for a good sanity check. If we want to also support 
reading the committed transactions while in-progress, it's not quite sufficient 
-- the last batch of transactions will never be readable if the NN stops 
writing new batches for a protracted period of time. To solve this, we can add 
a timer thread to the client which periodically (eg once or twice a second) 
sends an RPC to update the committed-txid on all of the nodes. The periodic 
timer will also have the nice property of causing a NN which has been fenced to 
abort itself even if no write transactions are taking place.

 QJM: track last committed txid
 

 Key: HDFS-3863
 URL: https://issues.apache.org/jira/browse/HDFS-3863
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 Per some discussion with [~stepinto] 
 [here|https://issues.apache.org/jira/browse/HDFS-3077?focusedCommentId=13422579page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13422579],
  we should keep track of the last committed txid on each JournalNode. Then 
 during any recovery operation, we can sanity-check that we aren't asked to 
 truncate a log to an earlier transaction.
 This is also a necessary step if we want to support reading from in-progress 
 segments in the future (since we should only allow reads up to the commit 
 point)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0

2012-08-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443577#comment-13443577
 ] 

Colin Patrick McCabe commented on HDFS-3731:


bq. Do you have a list of ones you know about? If not I can start pulling on 
that thread tomorrow.

Sorry, I just took a preliminary look, didn't have time to go in depth.

The state machine errors are pretty clear in the test.  You may need to wait a 
while for them to appear since surefire does a lot of buffering.

 2.0 release upgrade must handle blocks being written from 1.0
 -

 Key: HDFS-3731
 URL: https://issues.apache.org/jira/browse/HDFS-3731
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Suresh Srinivas
Assignee: Colin Patrick McCabe
Priority: Blocker
 Fix For: 2.2.0-alpha

 Attachments: hadoop1-bbw.tgz, HDFS-3731.002.patch, HDFS-3731.003.patch


 Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 
 release. Problem reported by Brahma Reddy.
 The {{DataNode}} will only have one block pool after upgrading from a 1.x 
 release.  (This is because in the 1.x releases, there were no block pools-- 
 or equivalently, everything was in the same block pool).  During the upgrade, 
 we should hardlink the block files from the {{blocksBeingWritten}} directory 
 into the {{rbw}} directory of this block pool.  Similarly, on {{-finalize}}, 
 we should delete the {{blocksBeingWritten}} directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)
Aaron T. Myers created HDFS-3864:


 Summary: NN does not update internal file mtime for OP_CLOSE when 
reading from the edit log
 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers


When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
mtime and atime. However, when reading in an OP_CLOSE from the edit log, the NN 
does not apply these values to the in-memory FS data structure. Because of 
this, a file's mtime or atime may appear to go back in time after an NN 
restart, or an HA failover.

Most of the time this will be harmless and folks won't notice, but in the event 
one of these files is being used in the distributed cache of an MR job when an 
HA failover occurs, the job might notice that the mtime of a cache file has 
changed, which in MR2 will cause the job to fail with an exception like the 
following:

{noformat}
java.io.IOException: Resource 
hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
 changed on src filesystem (expected 1342137814599, was 1342137814473
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{noformat}

Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3733) Audit logs should include WebHDFS access

2012-08-28 Thread Andy Isaacson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443584#comment-13443584
 ] 

Andy Isaacson commented on HDFS-3733:
-

OK, backing up -- I think my addition of CurClient just duplicates 
functionality already provided by NamenodeWebHdfsMethods#REMOTE_ADDRESS .  So I 
can drop that new ThreadLocal and just teach NameNodeRpcServer to use 
REMOTE_ADDRESS appropriately.

Or am I missing something?

bq. getRemoteIp should not just return NamenodeWebHdfsMethods#getRemoteAddress

(I assume you are referring to my newly added {{FSNamesystem#getRemoteIp}}.)

Agreed, FSNamesystem should support all remote methods: RPC, WebHdfs ... and 
Hftp?  The {{FSNamesystem#getRemoteIp}} should handle them all.

The helper {{NameNodeRpcServer#getRemoteIp}} implements the WebHdfs portion of 
{{FSNamesystem#getRemoteIp}} just as {{Server#getRemoteIp}} implements the RPC 
portion.

 Audit logs should include WebHDFS access
 

 Key: HDFS-3733
 URL: https://issues.apache.org/jira/browse/HDFS-3733
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Affects Versions: 2.0.0-alpha
Reporter: Andy Isaacson
Assignee: Andy Isaacson
 Attachments: hdfs-3733.txt


 Access via WebHdfs does not result in audit log entries.  It should.
 {noformat}
 % curl http://nn1:50070/webhdfs/v1/user/adi/hello.txt?op=GETFILESTATUS;
 {FileStatus:{accessTime:1343351432395,blockSize:134217728,group:supergroup,length:12,modificationTime:1342808158399,owner:adi,pathSuffix:,permission:644,replication:1,type:FILE}}
 {noformat}
 and observe that no audit log entry is generated.
 Interestingly, OPEN requests do not generate audit log entries when the NN 
 generates the redirect, but do generate audit log entries when the second 
 phase against the DN is executed.
 {noformat}
 % curl -v 'http://nn1:50070/webhdfs/v1/user/adi/hello.txt?op=OPEN'
 ...
  HTTP/1.1 307 TEMPORARY_REDIRECT
  Location: 
 http://dn01:50075/webhdfs/v1/user/adi/hello.txt?op=OPENnamenoderpcaddress=nn1:8020offset=0
 ...
 % curl -v 
 'http://dn01:50075/webhdfs/v1/user/adi/hello.txt?op=OPENnamenoderpcaddress=nn1:8020'
 ...
  HTTP/1.1 200 OK
  Content-Type: application/octet-stream
  Content-Length: 12
  Server: Jetty(6.1.26.cloudera.1)
  
 hello world
 {noformat}
 This happens because {{DatanodeWebHdfsMethods#get}} uses {{DFSClient#open}} 
 thereby triggering the existing {{logAuditEvent}} code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Status: Patch Available  (was: Open)

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Attachment: HDFS-3864.patch

Here's a patch which addresses the issue. Fortunately, the fix is quite simply 
- just apply the values that we read in from the edit log.

In addition to the automated test provided in the patch, I also tested this 
manually on an HA cluster and confirmed that MR jobs no longer experience the 
:distributed cache object changed errors which caused this issue to be 
discovered.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-3865) TestDistCp is @ignored

2012-08-28 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HDFS-3865:
--

 Summary: TestDistCp is @ignored
 Key: HDFS-3865
 URL: https://issues.apache.org/jira/browse/HDFS-3865
 Project: Hadoop HDFS
  Issue Type: Test
  Components: tools
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Priority: Minor


We should fix TestDistCp so that it actually runs, rather than being ignored.

{code}
@ignore
public class TestDistCp {
  private static final Log LOG = LogFactory.getLog(TestDistCp.class);
  private static ListPath pathList = new ArrayListPath();
  ...
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3849:
-

   Resolution: Fixed
Fix Version/s: 2.2.0-alpha
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've just committed this to trunk and branch-2.

Thanks a lot for the contribution, Colin.

 When re-loading the FSImage, we should clear the existing genStamp and leases.
 --

 Key: HDFS-3849
 URL: https://issues.apache.org/jira/browse/HDFS-3849
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Critical
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, 
 HDFS-3849.003.patch


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 This is an issue in the 2NN, because it sometimes clears the existing FSImage 
 and reloads a new one in order to get back in sync with the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

Attachment: HDFS-3864.patch

Thanks a lot for the quick review, Todd.

Here's an updated patch which lowers the sleep time to 10 milliseconds.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443598#comment-13443598
 ] 

Hudson commented on HDFS-3849:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2716 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2716/])
HDFS-3849. When re-loading the FSImage, we should clear the existing 
genStamp and leases. Contributed by Colin Patrick McCabe. (Revision 1378364)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378364
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSNamesystem.java


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 --

 Key: HDFS-3849
 URL: https://issues.apache.org/jira/browse/HDFS-3849
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Critical
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, 
 HDFS-3849.003.patch


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 This is an issue in the 2NN, because it sometimes clears the existing FSImage 
 and reloads a new one in order to get back in sync with the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443586#comment-13443586
 ] 

Aaron T. Myers edited comment on HDFS-3864 at 8/29/12 9:21 AM:
---

Here's a patch which addresses the issue. Fortunately, the fix is quite simple 
- just apply the values that we read in from the edit log.

In addition to the automated test provided in the patch, I also tested this 
manually on an HA cluster and confirmed that MR jobs no longer experience the 
distributed cache object changed errors which caused this issue to be 
discovered.

  was (Author: atm):
Here's a patch which addresses the issue. Fortunately, the fix is quite 
simply - just apply the values that we read in from the edit log.

In addition to the automated test provided in the patch, I also tested this 
manually on an HA cluster and confirmed that MR jobs no longer experience the 
:distributed cache object changed errors which caused this issue to be 
discovered.
  
 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-2264) NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo annotation

2012-08-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443610#comment-13443610
 ] 

Aaron T. Myers commented on HDFS-2264:
--

Hey Jitendra, sorry for forgetting about this JIRA for so long (almost exactly 
a year!)

I just encountered this issue again in a user's cluster. My new thinking is 
that we should just remove the expected client principal from the 
NamenodeProtocol entirely. I think this makes sense the 2NN, SBN, BN, and 
balancer all potentially use this interface, so there's no single client 
principal that could reasonably be expected. The balancer, in particular, 
should be able to be run from any node, even one not running a daemon at all.

I think to do what I propose here all we have to do is remove the 
clientPrincipal parameter from the SecurityInfo annotation on the 
NamenodeProtocol, and make sure that all of the methods exposed by this 
interface definitely check for super user privileges. I think most of them do, 
but we should ensure that they all do.

How does this sound to you?

 NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo 
 annotation
 ---

 Key: HDFS-2264
 URL: https://issues.apache.org/jira/browse/HDFS-2264
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Aaron T. Myers
Assignee: Harsh J
 Fix For: 0.24.0

 Attachments: HDFS-2264.r1.diff


 The {{@KerberosInfo}} annotation specifies the expected server and client 
 principals for a given protocol in order to look up the correct principal 
 name from the config. The {{NamenodeProtocol}} has the wrong value for the 
 client config key. This wasn't noticed because most setups actually use the 
 same *value* for for both the NN and 2NN principals ({{hdfs/_HOST@REALM}}), 
 in which the {{_HOST}} part gets replaced at run-time. This bug therefore 
 only manifests itself on secure setups which explicitly specify the NN and 
 2NN principals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (HDFS-2264) NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo annotation

2012-08-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443610#comment-13443610
 ] 

Aaron T. Myers edited comment on HDFS-2264 at 8/29/12 9:45 AM:
---

Hey Jitendra, sorry for forgetting about this JIRA for so long (almost exactly 
a year!)

I just encountered this issue again in a user's cluster. My new thinking is 
that we should just remove the expected client principal from the 
NamenodeProtocol entirely. I think this makes sense since the 2NN, SBN, BN, and 
balancer all potentially use this interface, so there's no single client 
principal that could reasonably be expected. The balancer, in particular, 
should be able to be run from any node, even one not running a daemon at all.

I think to do what I propose here all we have to do is remove the 
clientPrincipal parameter from the SecurityInfo annotation on the 
NamenodeProtocol, and make sure that all of the methods exposed by this 
interface definitely check for super user privileges. I think most of them do, 
but we should ensure that they all do.

How does this sound to you?

  was (Author: atm):
Hey Jitendra, sorry for forgetting about this JIRA for so long (almost 
exactly a year!)

I just encountered this issue again in a user's cluster. My new thinking is 
that we should just remove the expected client principal from the 
NamenodeProtocol entirely. I think this makes sense the 2NN, SBN, BN, and 
balancer all potentially use this interface, so there's no single client 
principal that could reasonably be expected. The balancer, in particular, 
should be able to be run from any node, even one not running a daemon at all.

I think to do what I propose here all we have to do is remove the 
clientPrincipal parameter from the SecurityInfo annotation on the 
NamenodeProtocol, and make sure that all of the methods exposed by this 
interface definitely check for super user privileges. I think most of them do, 
but we should ensure that they all do.

How does this sound to you?
  
 NamenodeProtocol has the wrong value for clientPrincipal in KerberosInfo 
 annotation
 ---

 Key: HDFS-2264
 URL: https://issues.apache.org/jira/browse/HDFS-2264
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
Reporter: Aaron T. Myers
Assignee: Harsh J
 Fix For: 0.24.0

 Attachments: HDFS-2264.r1.diff


 The {{@KerberosInfo}} annotation specifies the expected server and client 
 principals for a given protocol in order to look up the correct principal 
 name from the config. The {{NamenodeProtocol}} has the wrong value for the 
 client config key. This wasn't noticed because most setups actually use the 
 same *value* for for both the NN and 2NN principals ({{hdfs/_HOST@REALM}}), 
 in which the {{_HOST}} part gets replaced at run-time. This bug therefore 
 only manifests itself on secure setups which explicitly specify the NN and 
 2NN principals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3849) When re-loading the FSImage, we should clear the existing genStamp and leases.

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443624#comment-13443624
 ] 

Hudson commented on HDFS-3849:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2682 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2682/])
HDFS-3849. When re-loading the FSImage, we should clear the existing 
genStamp and leases. Contributed by Colin Patrick McCabe. (Revision 1378364)

 Result = FAILURE
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378364
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSNamesystem.java


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 --

 Key: HDFS-3849
 URL: https://issues.apache.org/jira/browse/HDFS-3849
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Critical
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3849.001.patch, HDFS-3849.002.patch, 
 HDFS-3849.003.patch


 When re-loading the FSImage, we should clear the existing genStamp and leases.
 This is an issue in the 2NN, because it sometimes clears the existing FSImage 
 and reloads a new one in order to get back in sync with the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HDFS-3466:


Attachment: hdfs-3466-b1-2.patch

Here's a patch that incorporates Eli's feedback.

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HDFS-3466:


Attachment: hdfs-3466-trunk.patch

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk-2.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HDFS-3466:


Attachment: (was: hdfs-3466-trunk.patch)

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk-2.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HDFS-3466:


Attachment: (was: hdfs-3466-trunk.patch)

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk-2.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HDFS-3466:


Attachment: hdfs-3466-trunk-2.patch

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk-2.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443645#comment-13443645
 ] 

Hadoop QA commented on HDFS-3466:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12542858/hdfs-3466-trunk-2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javac.  The patch appears to cause the build to fail.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3115//console

This message is automatically generated.

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk-2.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443654#comment-13443654
 ] 

Hadoop QA commented on HDFS-3466:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12542858/hdfs-3466-trunk-2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javac.  The patch appears to cause the build to fail.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3116//console

This message is automatically generated.

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk-2.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443657#comment-13443657
 ] 

Hadoop QA commented on HDFS-3864:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542840/HDFS-3864.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3113//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3113//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3113//console

This message is automatically generated.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3855) Replace hardcoded strings with the already defined config keys in DataNode.java

2012-08-28 Thread Brandon Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-3855:
-

Description: Replace hardcoded strings with the already defined config keys 
in DataNode.java 

 Replace hardcoded strings with the already defined config keys in 
 DataNode.java 
 

 Key: HDFS-3855
 URL: https://issues.apache.org/jira/browse/HDFS-3855
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 1.2.0
Reporter: Brandon Li
Assignee: Brandon Li
Priority: Trivial
 Attachments: HDFS-3855.branch-1.patch


 Replace hardcoded strings with the already defined config keys in 
 DataNode.java 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3135) Build a war file for HttpFS instead of packaging the server (tomcat) along with the application.

2012-08-28 Thread Ryan Hennig (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443710#comment-13443710
 ] 

Ryan Hennig commented on HDFS-3135:
---

I'm troubleshooting a broken build that fails on the Tomcat download, because 
our Jenkins server doesn't have internet access (by design).  Rather, all 
components are supposed to be fetched from our internal Maven Repository 
(Artifactory).  So while I don't need the war file change, I do think this 
direct download should be removed.

 Build a war file for HttpFS instead of packaging the server (tomcat) along 
 with the application.
 

 Key: HDFS-3135
 URL: https://issues.apache.org/jira/browse/HDFS-3135
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: build
Affects Versions: 0.23.2
Reporter: Ravi Prakash
  Labels: build

 There are several reason why web applications should not be packaged along 
 with the server that is expected to serve them. For one not all organisations 
 use vanilla tomcat. There are other reasons I won't go into.
 I'm filing this bug because some of our builds failed in trying to download 
 the tomcat.tar.gz file. We then had to manually wget the file and place it in 
 downloads/ to make the build pass. I suspect the download failed because of 
 an overloaded server (Frankly, I don't really know). If someone has ideas, 
 please share them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443712#comment-13443712
 ] 

Hadoop QA commented on HDFS-3864:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12542846/HDFS-3864.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestHftpDelegationToken
  org.apache.hadoop.hdfs.web.TestWebHDFS
  org.apache.hadoop.hdfs.server.datanode.TestBPOfferService

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3114//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3114//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3114//console

This message is automatically generated.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443715#comment-13443715
 ] 

Aaron T. Myers commented on HDFS-3864:
--

The findbugs warning is unrelated and I'm confident that the test failures are 
unrelated as well.

I'm going to commit this patch momentarily.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-3864:
-

   Resolution: Fixed
Fix Version/s: 2.2.0-alpha
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've just committed this to trunk and branch-2. Thanks a lot for the review, 
Todd.

 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443737#comment-13443737
 ] 

Hudson commented on HDFS-3864:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2717 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2717/])
HDFS-3864. NN does not update internal file mtime for OP_CLOSE when reading 
from the edit log. Contributed by Aaron T. Myers. (Revision 1378413)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378413
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestModTime.java


 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443738#comment-13443738
 ] 

Hudson commented on HDFS-3864:
--

Integrated in Hadoop-Common-trunk-Commit #2654 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2654/])
HDFS-3864. NN does not update internal file mtime for OP_CLOSE when reading 
from the edit log. Contributed by Aaron T. Myers. (Revision 1378413)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378413
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestModTime.java


 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3864) NN does not update internal file mtime for OP_CLOSE when reading from the edit log

2012-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443752#comment-13443752
 ] 

Hudson commented on HDFS-3864:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2683 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2683/])
HDFS-3864. NN does not update internal file mtime for OP_CLOSE when reading 
from the edit log. Contributed by Aaron T. Myers. (Revision 1378413)

 Result = FAILURE
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1378413
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestModTime.java


 NN does not update internal file mtime for OP_CLOSE when reading from the 
 edit log
 --

 Key: HDFS-3864
 URL: https://issues.apache.org/jira/browse/HDFS-3864
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Fix For: 2.2.0-alpha

 Attachments: HDFS-3864.patch, HDFS-3864.patch


 When logging an OP_CLOSE to the edit log, the NN writes out an updated file 
 mtime and atime. However, when reading in an OP_CLOSE from the edit log, the 
 NN does not apply these values to the in-memory FS data structure. Because of 
 this, a file's mtime or atime may appear to go back in time after an NN 
 restart, or an HA failover.
 Most of the time this will be harmless and folks won't notice, but in the 
 event one of these files is being used in the distributed cache of an MR job 
 when an HA failover occurs, the job might notice that the mtime of a cache 
 file has changed, which in MR2 will cause the job to fail with an exception 
 like the following:
 {noformat}
 java.io.IOException: Resource 
 hdfs://ha-nn-uri/user/jenkins/.staging/job_1341364439849_0513/libjars/snappy-java-1.0.3.2.jar
  changed on src filesystem (expected 1342137814599, was 1342137814473
   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:90)
   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157)
   at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}
 Credit to Sujay Rau for discovering this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1490) TransferFSImage should timeout

2012-08-28 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443791#comment-13443791
 ] 

Vinay commented on HDFS-1490:
-

{quote}Why not introduce a new config which defaults to something like 1 
minute?{quote}
Ok, agree. Will introduce new config for this.
{quote}In the test case, shouldn't you somehow notify the servlet to exit? 
Currently it waits on itself, but nothing notifies it.{quote}
That was just added make the client call get timeout. Ideally while stopping 
the server, that will be interrupted. Anyway I will add a timeout for that also.

Thanks todd, for comments. I will post new patch in sometime.

 TransferFSImage should timeout
 --

 Key: HDFS-1490
 URL: https://issues.apache.org/jira/browse/HDFS-1490
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov
Priority: Minor
 Attachments: HDFS-1490.patch, HDFS-1490.patch


 Sometimes when primary crashes during image transfer secondary namenode would 
 hang trying to read the image from HTTP connection forever.
 It would be great to set timeouts on the connection so if something like that 
 happens there is no need to restart the secondary itself.
 In our case restarting components is handled by the set of scripts and since 
 the Secondary as the process is running it would just stay hung until we get 
 an alarm saying the checkpointing doesn't happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3466) The SPNEGO filter for the NameNode should come out of the web keytab file

2012-08-28 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443810#comment-13443810
 ] 

Eli Collins commented on HDFS-3466:
---

Hey Owen, I think you meant to remove the 2nd initialization of httpKeytab.
 
{code}
+String httpKeytab = conf.get(
+  DFSConfigKeys.DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY);
+if (httpKeytab == null) {
+  httpKeytab = conf.get(DFSConfigKeys.DFS_NAMENODE_KEYTAB_FILE_KEY);
+}
 String httpKeytab = conf
   .get(DFSConfigKeys.DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY);
{code}

 The SPNEGO filter for the NameNode should come out of the web keytab file
 -

 Key: HDFS-3466
 URL: https://issues.apache.org/jira/browse/HDFS-3466
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node, security
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3466-b1-2.patch, hdfs-3466-b1.patch, 
 hdfs-3466-trunk-2.patch


 Currently, the spnego filter uses the DFS_NAMENODE_KEYTAB_FILE_KEY to find 
 the keytab. It should use the DFS_WEB_AUTHENTICATION_KERBEROS_KEYTAB_KEY to 
 do it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3865) TestDistCp is @ignored

2012-08-28 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443812#comment-13443812
 ] 

Eli Collins commented on HDFS-3865:
---

Looks like some of the tests are commented out as well (eg 
testUniformSizeDistCp).


 TestDistCp is @ignored
 --

 Key: HDFS-3865
 URL: https://issues.apache.org/jira/browse/HDFS-3865
 Project: Hadoop HDFS
  Issue Type: Test
  Components: tools
Affects Versions: 2.2.0-alpha
Reporter: Colin Patrick McCabe
Priority: Minor

 We should fix TestDistCp so that it actually runs, rather than being ignored.
 {code}
 @ignore
 public class TestDistCp {
   private static final Log LOG = LogFactory.getLog(TestDistCp.class);
   private static ListPath pathList = new ArrayListPath();
   ...
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-282) Serialize ipcPort in DatanodeID instead of DatanodeRegistration and DatanodeInfo

2012-08-28 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins resolved HDFS-282.
--

Resolution: Not A Problem

No longer an issue now that the writable methods have been removed.

 Serialize ipcPort in DatanodeID instead of DatanodeRegistration and 
 DatanodeInfo
 

 Key: HDFS-282
 URL: https://issues.apache.org/jira/browse/HDFS-282
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Tsz Wo (Nicholas), SZE

 The field DatanodeID.ipcPort is currently serialized in DatanodeRegistration 
 and DatanodeInfo.  Once HADOOP-2797 (remove the codes for handling old layout 
 ) is committed, DatanodeID.ipcPort should be serialized in DatanodeID.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning

2012-08-28 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443827#comment-13443827
 ] 

Eli Collins commented on HDFS-3837:
---

I investigated some more and confirmed findbugs isn't searching back far enough 
for the common subclass. Eg if I swap variables in the equals call I get:

{noformat}
org.apache.hadoop.hdfs.protocol.DatanodeInfo.equals(Object) used to determine 
equality
org.apache.hadoop.hdfs.server.common.JspHelper$NodeRecord.equals(Object) used 
to determine equality
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.equals(Object) 
used to determine equality
At DataNode.java:[line 1871]
{noformat}

It stops at DatanodeDescriptor#equals even though this calls super.equals 
(DatanodeInfo) which calls super.equals (DatanodeID). Just like the current 
warning stops at DatanodeRegistration#equals which calls super.equals 
(DatanodeID).

It would be better (and findbugs wouldn't choke) if the various classes that 
extend DatanodeID have a member instead. I looked at this for HDFS-3237 and it 
required a ton of changes that probably aren't worth it.

Given this I'll update the patch per your suggestion Surresh to ignore the 
warning in DataNode#recoverBlock.

 Fix DataNode.recoverBlock findbugs warning
 --

 Key: HDFS-3837
 URL: https://issues.apache.org/jira/browse/HDFS-3837
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt, 
 hdfs-3837.txt


 HDFS-2686 introduced the following findbugs warning:
 {noformat}
 Call to equals() comparing different types in 
 org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
 {noformat}
 Both are using DatanodeID#equals but it's a different method because 
 DNR#equals overrides equals for some reason (doesn't change behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3837) Fix DataNode.recoverBlock findbugs warning

2012-08-28 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-3837:
--

Attachment: hdfs-3837.txt

Updated patch attached.

 Fix DataNode.recoverBlock findbugs warning
 --

 Key: HDFS-3837
 URL: https://issues.apache.org/jira/browse/HDFS-3837
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hdfs-3837.txt, hdfs-3837.txt, hdfs-3837.txt, 
 hdfs-3837.txt


 HDFS-2686 introduced the following findbugs warning:
 {noformat}
 Call to equals() comparing different types in 
 org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(BlockRecoveryCommand$RecoveringBlock)
 {noformat}
 Both are using DatanodeID#equals but it's a different method because 
 DNR#equals overrides equals for some reason (doesn't change behavior).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira