[jira] [Commented] (HDFS-1952) FSEditLog.open() appears to succeed even if all EDITS directories fail

2011-05-23 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037746#comment-13037746
 ] 

Matt Foley commented on HDFS-1952:
--

+1 on the v22 version.
Confirmed compiles successfully, and new unit test fails before the FSEditLog 
change, passes after.

 FSEditLog.open() appears to succeed even if all EDITS directories fail
 --

 Key: HDFS-1952
 URL: https://issues.apache.org/jira/browse/HDFS-1952
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.22.0, 0.23.0
Reporter: Matt Foley
Assignee: Andrew Wang
  Labels: newbie
 Attachments: hdfs-1952-0.22.patch, hdfs-1952.patch, hdfs-1952.patch, 
 hdfs-1952.patch


 FSEditLog.open() appears to succeed even if all of the individual 
 directories failed to allow creation of an EditLogOutputStream.  The problem 
 and solution are essentially similar to that of HDFS-1505.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1727) fsck command can display command usage if user passes any illegal argument

2011-05-23 Thread Uma Maheswara Rao G (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G updated HDFS-1727:
--

Description: 
In fsck command if user passes the arguments like
./hadoop fsck -test -files -blocks -racks
In this case it will take / and will display whole DFS information regarding to 
files,blocks,racks.

But here, we are hiding the user mistake. Instead of this, we can display the 
command usage if user passes any invalid argument like above.

If user passes illegal optional arguments like
./hadoop fsck /test -listcorruptfileblocks instead of
./hadoop fsck /test -list-corruptfileblocks also we can display the proper 
command usage

  was:
In fsck command if user passes the arguments like 
   ./hadoop fsck -test -files -blocks -racks 
   In this case it will take / and will display whole DFS information regarding 
to files,blocks,racks.

But here, we are hiding the user mistake. Instead of this, we can display the 
usage if user passes any invalid argument like above.


   Assignee: Uma Maheswara Rao G
Summary: fsck command can display command usage if user passes any 
illegal argument  (was: fsck command can display usage if user passes any other 
arguments with '-' ( other than -move, -delete, -files , -openforwrite, -blocks 
, -locations, -racks).)

 fsck command can display command usage if user passes any illegal argument
 --

 Key: HDFS-1727
 URL: https://issues.apache.org/jira/browse/HDFS-1727
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G
Priority: Minor

 In fsck command if user passes the arguments like
 ./hadoop fsck -test -files -blocks -racks
 In this case it will take / and will display whole DFS information regarding 
 to files,blocks,racks.
 But here, we are hiding the user mistake. Instead of this, we can display the 
 command usage if user passes any invalid argument like above.
 If user passes illegal optional arguments like
 ./hadoop fsck /test -listcorruptfileblocks instead of
 ./hadoop fsck /test -list-corruptfileblocks also we can display the proper 
 command usage

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-1981) When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)
When namenode goes down while checkpointing and if is started again subsequent 
Checkpointing is always failing
--

 Key: HDFS-1981
 URL: https://issues.apache.org/jira/browse/HDFS-1981
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
 Environment: Linux
Reporter: ramkrishna.s.vasudevan
 Fix For: 0.23.0


This scenario is applicable in NN and BNN case.

When the namenode goes down after creating the edits.new, on subsequent restart 
the divertFileStreams will not happen to edits.new as the edits.new file is 
already present and the size is zero.

so on trying to saveCheckPoint an exception occurs 
2011-05-23 16:38:57,476 WARN org.mortbay.log: /getimage: java.io.IOException: 
GetImage failed. java.io.IOException: Namenode has an edit log with timestamp 
of 2011-05-23 16:38:56 but new checkpoint was created using editlog  with 
timestamp 2011-05-23 16:37:30. Checkpoint Aborted.

This is a bug or is that the behaviour.







--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1978) All but first option in LIBHDFS_OPTS is ignored

2011-05-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037912#comment-13037912
 ] 

Hudson commented on HDFS-1978:
--

Integrated in Hadoop-Hdfs-trunk #675 (See 
[https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/675/])
HDFS-1978. All but first option in LIBHDFS_OPTS is ignored. Contributed by 
Eli Collins

eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1126312
Files : 
* /hadoop/hdfs/trunk/src/c++/libhdfs/hdfsJniHelper.c
* /hadoop/hdfs/trunk/CHANGES.txt


 All but first option in LIBHDFS_OPTS is ignored
 ---

 Key: HDFS-1978
 URL: https://issues.apache.org/jira/browse/HDFS-1978
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: libhdfs
Affects Versions: 0.21.0
 Environment: RHEL 5.5
 JDK 1.6.0_24
Reporter: Brock Noland
Assignee: Eli Collins
 Fix For: 0.22.0

 Attachments: HDFS-1978.0.patch, hdfs-1978-1.patch


 In getJNIEnv, we go though LIBHDFS_OPTS with strok and count the number of 
 args. Then create an array of options based on that information. But when we 
 actually setup the options we only the first arg. I believe the fix is pasted 
 inline.
 {noformat}
 Index: src/c++/libhdfs/hdfsJniHelper.c
 ===
 --- src/c++/libhdfs/hdfsJniHelper.c   (revision 1124544)
 +++ src/c++/libhdfs/hdfsJniHelper.c   (working copy)
 @@ -442,6 +442,7 @@
  int argNum = 1;
  for (;argNum  noArgs ; argNum++) {
  options[argNum].optionString = result; //optHadoopArg;
 +result = strtok( NULL, jvmArgDelims);
  }
  }
 {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-1982) Null pointer exception is thrown when NN restarts with a block lesser in size than the block that is present in DN1 but the generation stamp is greater in the NN

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)
Null pointer exception is thrown when NN restarts with a block lesser in size 
than the block that is present in DN1 but the generation stamp is greater in 
the NN 
--

 Key: HDFS-1982
 URL: https://issues.apache.org/jira/browse/HDFS-1982
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append
 Environment: Linux
Reporter: ramkrishna.s.vasudevan
 Fix For: 0.20-append


Conisder the following scenario. 
WE have a cluster with one NN and 2 DN.

We write some file.

One of the block is written in DN1 but not yet completed in DN2 local disk.

Now DN1 gets killed and so pipeline recovery happens for the block with the 
size as in DN2 but the generation stamp gets updated in the NN.

DN2 also gets killed.

Now restart NN and DN1
Now if NN restarts, the block that NN has greater time stamp but the size is 
lesser in the NN.

This leads to Null pointer exception in addstoredblock api




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1981) When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing

2011-05-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037949#comment-13037949
 ] 

Todd Lipcon commented on HDFS-1981:
---

Hi Ramkrishna. Can you provide a unit test which shows this issue? It would be 
especially good to see such a test against 0.22, since HDFS-1073 will 
restructure all this code when it's merged into 0.23.

 When namenode goes down while checkpointing and if is started again 
 subsequent Checkpointing is always failing
 --

 Key: HDFS-1981
 URL: https://issues.apache.org/jira/browse/HDFS-1981
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.0
 Environment: Linux
Reporter: ramkrishna.s.vasudevan
 Fix For: 0.23.0


 This scenario is applicable in NN and BNN case.
 When the namenode goes down after creating the edits.new, on subsequent 
 restart the divertFileStreams will not happen to edits.new as the edits.new 
 file is already present and the size is zero.
 so on trying to saveCheckPoint an exception occurs 
 2011-05-23 16:38:57,476 WARN org.mortbay.log: /getimage: java.io.IOException: 
 GetImage failed. java.io.IOException: Namenode has an edit log with timestamp 
 of 2011-05-23 16:38:56 but new checkpoint was created using editlog  with 
 timestamp 2011-05-23 16:37:30. Checkpoint Aborted.
 This is a bug or is that the behaviour.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HDFS-1950:
-

Attachment: HDFS-1950-2.patch

 Blocks that are under construction are not getting read if the blocks are 
 more than 10. Only complete blocks are read properly. 
 

 Key: HDFS-1950
 URL: https://issues.apache.org/jira/browse/HDFS-1950
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: ramkrishna.s.vasudevan
 Fix For: 0.20-append

 Attachments: HDFS-1950-2.patch


 Before going to the root cause lets see the read behavior for a file having 
 more than 10 blocks in append case.. 
 Logic: 
  
 There is prefetch size dfs.read.prefetch.size for the DFSInputStream which 
 has default value of 10 
 This prefetch size is the number of blocks that the client will fetch from 
 the namenode for reading a file.. 
 For example lets assume that a file X having 22 blocks is residing in HDFS 
 The reader first fetches first 10 blocks from the namenode and start reading 
 After the above step , the reader fetches the next 10 blocks from NN and 
 continue reading 
 Then the reader fetches the remaining 2 blocks from NN and complete the write 
 Cause: 
 === 
 Lets see the cause for this issue now... 
 Scenario that will fail is Writer wrote 10+ blocks and a partial block and 
 called sync. Reader trying to read the file will not get the last partial 
 block . 
 Client first gets the 10 block locations from the NN. Now it checks whether 
 the file is under construction and if so it gets the size of the last partial 
 block from datanode and reads the full file 
 However when the number of blocks is more than 10, the last block will not be 
 in the first fetch. It will be in the second or other blocks(last block will 
 be in (num of blocks / 10)th fetch) 
 The problem now is, in DFSClient there is no logic to get the size of the 
 last partial block(as in case of point 1), for the rest of the fetches other 
 than first fetch, the reader will not be able to read the complete data 
 synced...!! 
 also the InputStream.available api uses the first fetched block size to 
 iterate. Ideally this size has to be increased

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HDFS-1949:
-

Attachment: hdfs-1949.patch

 Number format Exception is displayed in Namenode UI when the chunk size field 
 is blank or string value.. 
 -

 Key: HDFS-1949
 URL: https://issues.apache.org/jira/browse/HDFS-1949
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append, 0.21.0, 0.23.0
Reporter: ramkrishna.s.vasudevan
Priority: Minor
 Fix For: 0.23.0

 Attachments: HDFS-1949.patch, hdfs-1949.patch


 In the Namenode UI we have a text box to enter the chunk size.
 The expected value for the chunk size is a valid Integer value.
 If any invalid value, string or empty spaces are provided it throws number 
 format exception.
 The existing behaviour is like we need to consider the default value if no 
 value is specified.
 Soln
 
 We can handle numberformat exception and assign default value if invalid 
 value is specified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HDFS-1950:
-

Status: Patch Available  (was: Open)

 Blocks that are under construction are not getting read if the blocks are 
 more than 10. Only complete blocks are read properly. 
 

 Key: HDFS-1950
 URL: https://issues.apache.org/jira/browse/HDFS-1950
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: ramkrishna.s.vasudevan
 Fix For: 0.20-append

 Attachments: HDFS-1950-2.patch


 Before going to the root cause lets see the read behavior for a file having 
 more than 10 blocks in append case.. 
 Logic: 
  
 There is prefetch size dfs.read.prefetch.size for the DFSInputStream which 
 has default value of 10 
 This prefetch size is the number of blocks that the client will fetch from 
 the namenode for reading a file.. 
 For example lets assume that a file X having 22 blocks is residing in HDFS 
 The reader first fetches first 10 blocks from the namenode and start reading 
 After the above step , the reader fetches the next 10 blocks from NN and 
 continue reading 
 Then the reader fetches the remaining 2 blocks from NN and complete the write 
 Cause: 
 === 
 Lets see the cause for this issue now... 
 Scenario that will fail is Writer wrote 10+ blocks and a partial block and 
 called sync. Reader trying to read the file will not get the last partial 
 block . 
 Client first gets the 10 block locations from the NN. Now it checks whether 
 the file is under construction and if so it gets the size of the last partial 
 block from datanode and reads the full file 
 However when the number of blocks is more than 10, the last block will not be 
 in the first fetch. It will be in the second or other blocks(last block will 
 be in (num of blocks / 10)th fetch) 
 The problem now is, in DFSClient there is no logic to get the size of the 
 last partial block(as in case of point 1), for the rest of the fetches other 
 than first fetch, the reader will not be able to read the complete data 
 synced...!! 
 also the InputStream.available api uses the first fetched block size to 
 iterate. Ideally this size has to be increased

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HDFS-1949:
-

Status: Patch Available  (was: Open)

 Number format Exception is displayed in Namenode UI when the chunk size field 
 is blank or string value.. 
 -

 Key: HDFS-1949
 URL: https://issues.apache.org/jira/browse/HDFS-1949
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.21.0, 0.20-append, 0.23.0
Reporter: ramkrishna.s.vasudevan
Priority: Minor
 Fix For: 0.23.0

 Attachments: HDFS-1949.patch, hdfs-1949.patch


 In the Namenode UI we have a text box to enter the chunk size.
 The expected value for the chunk size is a valid Integer value.
 If any invalid value, string or empty spaces are provided it throws number 
 format exception.
 The existing behaviour is like we need to consider the default value if no 
 value is specified.
 Soln
 
 We can handle numberformat exception and assign default value if invalid 
 value is specified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HDFS-1949:
-

Status: Open  (was: Patch Available)

 Number format Exception is displayed in Namenode UI when the chunk size field 
 is blank or string value.. 
 -

 Key: HDFS-1949
 URL: https://issues.apache.org/jira/browse/HDFS-1949
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.21.0, 0.20-append, 0.23.0
Reporter: ramkrishna.s.vasudevan
Priority: Minor
 Fix For: 0.23.0

 Attachments: HDFS-1949.patch, hdfs-1949.patch


 In the Namenode UI we have a text box to enter the chunk size.
 The expected value for the chunk size is a valid Integer value.
 If any invalid value, string or empty spaces are provided it throws number 
 format exception.
 The existing behaviour is like we need to consider the default value if no 
 value is specified.
 Soln
 
 We can handle numberformat exception and assign default value if invalid 
 value is specified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.

2011-05-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037981#comment-13037981
 ] 

Hadoop QA commented on HDFS-1950:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12480115/HDFS-1950-2.patch
  against trunk revision 1126312.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/612//console

This message is automatically generated.

 Blocks that are under construction are not getting read if the blocks are 
 more than 10. Only complete blocks are read properly. 
 

 Key: HDFS-1950
 URL: https://issues.apache.org/jira/browse/HDFS-1950
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: ramkrishna.s.vasudevan
 Fix For: 0.20-append

 Attachments: HDFS-1950-2.patch


 Before going to the root cause lets see the read behavior for a file having 
 more than 10 blocks in append case.. 
 Logic: 
  
 There is prefetch size dfs.read.prefetch.size for the DFSInputStream which 
 has default value of 10 
 This prefetch size is the number of blocks that the client will fetch from 
 the namenode for reading a file.. 
 For example lets assume that a file X having 22 blocks is residing in HDFS 
 The reader first fetches first 10 blocks from the namenode and start reading 
 After the above step , the reader fetches the next 10 blocks from NN and 
 continue reading 
 Then the reader fetches the remaining 2 blocks from NN and complete the write 
 Cause: 
 === 
 Lets see the cause for this issue now... 
 Scenario that will fail is Writer wrote 10+ blocks and a partial block and 
 called sync. Reader trying to read the file will not get the last partial 
 block . 
 Client first gets the 10 block locations from the NN. Now it checks whether 
 the file is under construction and if so it gets the size of the last partial 
 block from datanode and reads the full file 
 However when the number of blocks is more than 10, the last block will not be 
 in the first fetch. It will be in the second or other blocks(last block will 
 be in (num of blocks / 10)th fetch) 
 The problem now is, in DFSClient there is no logic to get the size of the 
 last partial block(as in case of point 1), for the rest of the fetches other 
 than first fetch, the reader will not be able to read the complete data 
 synced...!! 
 also the InputStream.available api uses the first fetched block size to 
 iterate. Ideally this size has to be increased

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1950) Blocks that are under construction are not getting read if the blocks are more than 10. Only complete blocks are read properly.

2011-05-23 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037985#comment-13037985
 ] 

ramkrishna.s.vasudevan commented on HDFS-1950:
--

This patch applies for 0.20 append branch.

 Blocks that are under construction are not getting read if the blocks are 
 more than 10. Only complete blocks are read properly. 
 

 Key: HDFS-1950
 URL: https://issues.apache.org/jira/browse/HDFS-1950
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: ramkrishna.s.vasudevan
 Fix For: 0.20-append

 Attachments: HDFS-1950-2.patch


 Before going to the root cause lets see the read behavior for a file having 
 more than 10 blocks in append case.. 
 Logic: 
  
 There is prefetch size dfs.read.prefetch.size for the DFSInputStream which 
 has default value of 10 
 This prefetch size is the number of blocks that the client will fetch from 
 the namenode for reading a file.. 
 For example lets assume that a file X having 22 blocks is residing in HDFS 
 The reader first fetches first 10 blocks from the namenode and start reading 
 After the above step , the reader fetches the next 10 blocks from NN and 
 continue reading 
 Then the reader fetches the remaining 2 blocks from NN and complete the write 
 Cause: 
 === 
 Lets see the cause for this issue now... 
 Scenario that will fail is Writer wrote 10+ blocks and a partial block and 
 called sync. Reader trying to read the file will not get the last partial 
 block . 
 Client first gets the 10 block locations from the NN. Now it checks whether 
 the file is under construction and if so it gets the size of the last partial 
 block from datanode and reads the full file 
 However when the number of blocks is more than 10, the last block will not be 
 in the first fetch. It will be in the second or other blocks(last block will 
 be in (num of blocks / 10)th fetch) 
 The problem now is, in DFSClient there is no logic to get the size of the 
 last partial block(as in case of point 1), for the rest of the fetches other 
 than first fetch, the reader will not be able to read the complete data 
 synced...!! 
 also the InputStream.available api uses the first fetched block size to 
 iterate. Ideally this size has to be increased

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1787) Not enough xcievers error should propagate to client

2011-05-23 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13037996#comment-13037996
 ] 

Jonathan Hsieh commented on HDFS-1787:
--

Actually, I assumed that the test focused on the patch in the 'changes' section 
of the jenkins result of build 608.  This actually ran the newly added test 
case from the HDFS-1787 patch..  

The 
org.apache.hadoop.hdfs.server.datanode.TestFiDataTransferProtocol2.pipeline_Fi_30
  test seems to be intermittently failing.   It also isn't reported by hudson.  
 Is there a reason why?  

 Not enough xcievers error should propagate to client
 --

 Key: HDFS-1787
 URL: https://issues.apache.org/jira/browse/HDFS-1787
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Jonathan Hsieh
  Labels: newbie
 Fix For: 0.23.0

 Attachments: hdfs-1787.2.patch, hdfs-1787.3.patch, hdfs-1787.3.patch, 
 hdfs-1787.patch


 We find that users often run into the default transceiver limits in the DN. 
 Putting aside the inherent issues with xceiver threads, it would be nice if 
 the xceiver limit exceeded error propagated to the client. Currently, 
 clients simply see an EOFException which is hard to interpret, and have to go 
 slogging through DN logs to find the underlying issue.
 The data transfer protocol should be extended to either have a special error 
 code for not enough xceivers or should have some error code for generic 
 errors with which a string can be attached and propagated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1949) Number format Exception is displayed in Namenode UI when the chunk size field is blank or string value..

2011-05-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038005#comment-13038005
 ] 

Hadoop QA commented on HDFS-1949:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12480116/hdfs-1949.patch
  against trunk revision 1126312.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

-1 release audit.  The applied patch generated 1 release audit warnings 
(more than the trunk's current 0 warnings).

-1 core tests.  The patch failed these core unit tests:
  org.apache.hadoop.hdfs.TestDFSStorageStateRecovery

+1 contrib tests.  The patch passed contrib unit tests.

+1 system test framework.  The patch passed system test framework compile.

Test results: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//testReport/
Release audit warnings: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/611//console

This message is automatically generated.

 Number format Exception is displayed in Namenode UI when the chunk size field 
 is blank or string value.. 
 -

 Key: HDFS-1949
 URL: https://issues.apache.org/jira/browse/HDFS-1949
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append, 0.21.0, 0.23.0
Reporter: ramkrishna.s.vasudevan
Priority: Minor
 Fix For: 0.23.0

 Attachments: HDFS-1949.patch, hdfs-1949.patch


 In the Namenode UI we have a text box to enter the chunk size.
 The expected value for the chunk size is a valid Integer value.
 If any invalid value, string or empty spaces are provided it throws number 
 format exception.
 The existing behaviour is like we need to consider the default value if no 
 value is specified.
 Soln
 
 We can handle numberformat exception and assign default value if invalid 
 value is specified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections

2011-05-23 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038055#comment-13038055
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1965:
--

Came up a question: By setting maxidletime to 0, is there a race condition that 
the timeout occurs before the first call, i.e. the proxy is closed before the 
first call?

 IPCs done using block token-based tickets can't reuse connections
 -

 Key: HDFS-1965
 URL: https://issues.apache.org/jira/browse/HDFS-1965
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.22.0

 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, 
 hdfs-1965.txt


 This is the reason that TestFileConcurrentReaders has been failing a lot. 
 Reproducing a comment from HDFS-1057:
 The test has a thread which continually re-opens the file which is being 
 written to. Since the file's in the middle of being written, it makes an RPC 
 to the DataNode in order to determine the visible length of the file. This 
 RPC is authenticated using the block token which came back in the 
 LocatedBlocks object as the security ticket.
 When this RPC hits the IPC layer, it looks at its existing connections and 
 sees none that can be re-used, since the block token differs between the two 
 requesters. Hence, it reconnects, and we end up with hundreds or thousands of 
 IPC connections to the datanode.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-1828) TestBlocksWithNotEnoughRacks intermittently fails assert

2011-05-23 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE resolved HDFS-1828.
--

Resolution: Fixed
  Assignee: Matt Foley

Since we are not reverting the patch, re-close this.  If the test is still 
failing, please create a new issue.

 TestBlocksWithNotEnoughRacks intermittently fails assert
 

 Key: HDFS-1828
 URL: https://issues.apache.org/jira/browse/HDFS-1828
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: 0.23.0
Reporter: Matt Foley
Assignee: Matt Foley
 Fix For: 0.23.0

 Attachments: TestBlocksWithNotEnoughRacks.java.patch, 
 TestBlocksWithNotEnoughRacks_v2.patch


 In 
 server.namenode.TestBlocksWithNotEnoughRacks.testSufficientlyReplicatedBlocksWithNotEnoughRacks
  
 assert fails at curReplicas == REPLICATION_FACTOR, but it seems that it 
 should go higher initially, and if the test doesn't wait for it to go back 
 down, it will fail false positive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections

2011-05-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038072#comment-13038072
 ] 

Todd Lipcon commented on HDFS-1965:
---

I think in trunk, it's not possible, since the connection is only lazily opened 
by the actual RPC to the DataNode. Then, it won't close since there's a call 
outstanding.

In 0.22, it's possible that it will open one connection for the 
getProtocolVersion() call and a second one for the actual RPC. Unless I'm 
missing something, that should only be an efficiency issue and not a 
correctness issue. Do you agree?

 IPCs done using block token-based tickets can't reuse connections
 -

 Key: HDFS-1965
 URL: https://issues.apache.org/jira/browse/HDFS-1965
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.22.0

 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, 
 hdfs-1965.txt


 This is the reason that TestFileConcurrentReaders has been failing a lot. 
 Reproducing a comment from HDFS-1057:
 The test has a thread which continually re-opens the file which is being 
 written to. Since the file's in the middle of being written, it makes an RPC 
 to the DataNode in order to determine the visible length of the file. This 
 RPC is authenticated using the block token which came back in the 
 LocatedBlocks object as the security ticket.
 When this RPC hits the IPC layer, it looks at its existing connections and 
 sees none that can be re-used, since the block token differs between the two 
 requesters. Hence, it reconnects, and we end up with hundreds or thousands of 
 IPC connections to the datanode.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections

2011-05-23 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038076#comment-13038076
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1965:
--

Okay, I fine with it since it is only a temporary fix.

+1 the 0.22 patch looks good.

 IPCs done using block token-based tickets can't reuse connections
 -

 Key: HDFS-1965
 URL: https://issues.apache.org/jira/browse/HDFS-1965
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.22.0

 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, 
 hdfs-1965.txt


 This is the reason that TestFileConcurrentReaders has been failing a lot. 
 Reproducing a comment from HDFS-1057:
 The test has a thread which continually re-opens the file which is being 
 written to. Since the file's in the middle of being written, it makes an RPC 
 to the DataNode in order to determine the visible length of the file. This 
 RPC is authenticated using the block token which came back in the 
 LocatedBlocks object as the security ticket.
 When this RPC hits the IPC layer, it looks at its existing connections and 
 sees none that can be re-used, since the block token differs between the two 
 requesters. Hence, it reconnects, and we end up with hundreds or thousands of 
 IPC connections to the datanode.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1853) refactor TestNodeCount to import standard node counting and wait for replication methods

2011-05-23 Thread Matt Foley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated HDFS-1853:
-

Issue Type: Improvement  (was: Sub-task)
Parent: (was: HDFS-1852)

 refactor TestNodeCount to import standard node counting and wait for 
 replication methods
 --

 Key: HDFS-1853
 URL: https://issues.apache.org/jira/browse/HDFS-1853
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 0.22.0
Reporter: Matt Foley

 Eli's suggestions for refactoring the three wait for loops in TestNodeCount 
 for re-usability (similar to what was done for HDFS-1562): You could augment 
 NameNodeAdapter#getReplicaInfo to return excess and live replica counts as 
 well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil 
 and have TestNodeCount call them. This way we could re-use them in the other 
 replication tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1853) refactor TestNodeCount to import standard node counting and wait for replication methods

2011-05-23 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038098#comment-13038098
 ] 

Matt Foley commented on HDFS-1853:
--

Removed from HDFS-1852 umbrella task, since not related to recurring Hudson 
test failures.

 refactor TestNodeCount to import standard node counting and wait for 
 replication methods
 --

 Key: HDFS-1853
 URL: https://issues.apache.org/jira/browse/HDFS-1853
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 0.22.0
Reporter: Matt Foley

 Eli's suggestions for refactoring the three wait for loops in TestNodeCount 
 for re-usability (similar to what was done for HDFS-1562): You could augment 
 NameNodeAdapter#getReplicaInfo to return excess and live replica counts as 
 well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil 
 and have TestNodeCount call them. This way we could re-use them in the other 
 replication tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HDFS-1853) refactor TestNodeCount to import standard node counting and wait for replication methods

2011-05-23 Thread Matt Foley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley reassigned HDFS-1853:


Assignee: Matt Foley

 refactor TestNodeCount to import standard node counting and wait for 
 replication methods
 --

 Key: HDFS-1853
 URL: https://issues.apache.org/jira/browse/HDFS-1853
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 0.22.0
Reporter: Matt Foley
Assignee: Matt Foley

 Eli's suggestions for refactoring the three wait for loops in TestNodeCount 
 for re-usability (similar to what was done for HDFS-1562): You could augment 
 NameNodeAdapter#getReplicaInfo to return excess and live replica counts as 
 well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil 
 and have TestNodeCount call them. This way we could re-use them in the other 
 replication tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1401) TestFileConcurrentReader test case is still timing out / failing

2011-05-23 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038112#comment-13038112
 ] 

Matt Foley commented on HDFS-1401:
--

As of build 611 (May 23), we see:

TestFileConcurrentReader.testUnfinishedBlockCRCErrorTransferToVerySmallWrite 
failed almost every build through 604, but has passed the last five builds in 
which auto-test ran.  This may be fixed, but needs to still be watched for 
intermittent failure.

TestFileConcurrentReader.testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite
 and
TestFileConcurrentReader.testUnfinishedBlockCRCErrorNormalTransfer failed 
intermittently through build 600, but have not failed since.  However, they are 
infrequent intermittent, and have been skipping six or eight builds between 
failures.  They remain on the watch list.

 TestFileConcurrentReader test case is still timing out / failing
 

 Key: HDFS-1401
 URL: https://issues.apache.org/jira/browse/HDFS-1401
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs client
Affects Versions: 0.22.0
Reporter: Tanping Wang
Priority: Critical
 Attachments: HDFS-1401.patch


 The unit test case, TestFileConcurrentReader after its most recent fix in 
 HDFS-1310 still times out when using java 1.6.0_07.  When using java 
 1.6.0_07, the test case simply hangs.  On apache Hudson build ( which 
 possibly is using a higher sub-version of java) this test case has presented 
 an inconsistent test result that it sometimes passes, some times fails. For 
 example, between the most recent build 423, 424 and build 425, there is no 
 effective change, however, the test case failed on build 424 and passed on 
 build 425
 build 424 test failed
 https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/424/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/
 build 425 test passed
 https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/425/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1401) TestFileConcurrentReader test case is still timing out / failing

2011-05-23 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038119#comment-13038119
 ] 

sam rash commented on HDFS-1401:


see todd's find in:

https://issues.apache.org/jira/browse/HDFS-1057


 TestFileConcurrentReader test case is still timing out / failing
 

 Key: HDFS-1401
 URL: https://issues.apache.org/jira/browse/HDFS-1401
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs client
Affects Versions: 0.22.0
Reporter: Tanping Wang
Priority: Critical
 Attachments: HDFS-1401.patch


 The unit test case, TestFileConcurrentReader after its most recent fix in 
 HDFS-1310 still times out when using java 1.6.0_07.  When using java 
 1.6.0_07, the test case simply hangs.  On apache Hudson build ( which 
 possibly is using a higher sub-version of java) this test case has presented 
 an inconsistent test result that it sometimes passes, some times fails. For 
 example, between the most recent build 423, 424 and build 425, there is no 
 effective change, however, the test case failed on build 424 and passed on 
 build 425
 build 424 test failed
 https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/424/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/
 build 425 test passed
 https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/425/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1852) Umbrella task: Clean up HDFS unit test recurring failures

2011-05-23 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038135#comment-13038135
 ] 

Matt Foley commented on HDFS-1852:
--

Besides the three remaining open issues above, we also have three 
infrequent-intermittent issues that may still exist.  All were last seen in 
build 594, so it possible they were addressed.

org.apache.hadoop.cli.TestHDFSCLI.testAll - v intermittent, last 566, 579, 587, 
594
org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery.testErrorReplicas - v 
intermittent, last 559, 566, 579, 594
org.apache.hadoop.hdfs.server.datanode.TestBlockReplacement.testBlockReplacement
 - v intermittent, last 565, 578, 594

Recording here for watchlist purposes.

 Umbrella task: Clean up HDFS unit test recurring failures 
 --

 Key: HDFS-1852
 URL: https://issues.apache.org/jira/browse/HDFS-1852
 Project: Hadoop HDFS
  Issue Type: Test
  Components: test
Affects Versions: 0.22.0
Reporter: Matt Foley

 Recurring failures and false positives undermine CI by encouraging developers 
 to ignore unit test failures.  Let's clean these up!
 Some are intermittent due to timing-sensitive conditions.  The unit tests for 
 background thread activities (such as block replication and corrupt replica 
 detection) often use wait while or wait until loops to detect results.  
 The quality and robustness of these loops vary widely, and common usages 
 should be moved to DFSTestUtil.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-236) Random read benchmark for DFS

2011-05-23 Thread Dave Thompson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Thompson updated HDFS-236:
---

Attachment: RndRead-TestDFSIO.patch

I've taken Raghu's patch from 6/27/09 with random read TestDFSIO enhancement, 
and ported it to the latest (now mapreduce) trunk 5/4/11 svn rev 1099590.   
Patch attached RndRead-TestDFSIO.patch.

enjoy,
Dave

 Random read benchmark for DFS
 -

 Key: HDFS-236
 URL: https://issues.apache.org/jira/browse/HDFS-236
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Raghu Angadi
Assignee: Raghu Angadi
 Attachments: HDFS-236.patch, RndRead-TestDFSIO.patch


 We should have at least one  random read benchmark that can be run with rest 
 of Hadoop benchmarks regularly.
 Please provide benchmark  ideas or requirements.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-1983) Fix path display for copy rm

2011-05-23 Thread Daryn Sharp (JIRA)
Fix path display for copy  rm
--

 Key: HDFS-1983
 URL: https://issues.apache.org/jira/browse/HDFS-1983
 Project: Hadoop HDFS
  Issue Type: Test
  Components: test
Affects Versions: 0.23.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp


This will also fix a few misc broken tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-988) saveNamespace can corrupt edits log

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-988:
-

Status: Open  (was: Patch Available)

removing patch available status since this still needs to be finished up.

 saveNamespace can corrupt edits log
 ---

 Key: HDFS-988
 URL: https://issues.apache.org/jira/browse/HDFS-988
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.21.0, 0.20-append, 0.22.0
Reporter: dhruba borthakur
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.20-append, 0.22.0

 Attachments: HDFS-988_fix_synchs.patch, hdfs-988-2.patch, 
 hdfs-988.txt, saveNamespace.txt, saveNamespace_20-append.patch


 The adminstrator puts the namenode is safemode and then issues the 
 savenamespace command. This can corrupt the edits log. The problem is that  
 when the NN enters safemode, there could still be pending logSycs occuring 
 from other threads. Now, the saveNamespace command, when executed, would save 
 a edits log with partial writes. I have seen this happen on 0.20.
 https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12828853page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12828853

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1965) IPCs done using block token-based tickets can't reuse connections

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1965:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed the 22 patch. Thanks, Nicholas. HADOOP-7317 tracks the real 
underlying issue.

 IPCs done using block token-based tickets can't reuse connections
 -

 Key: HDFS-1965
 URL: https://issues.apache.org/jira/browse/HDFS-1965
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.22.0

 Attachments: hdfs-1965-0.22.txt, hdfs-1965.txt, hdfs-1965.txt, 
 hdfs-1965.txt


 This is the reason that TestFileConcurrentReaders has been failing a lot. 
 Reproducing a comment from HDFS-1057:
 The test has a thread which continually re-opens the file which is being 
 written to. Since the file's in the middle of being written, it makes an RPC 
 to the DataNode in order to determine the visible length of the file. This 
 RPC is authenticated using the block token which came back in the 
 LocatedBlocks object as the security ticket.
 When this RPC hits the IPC layer, it looks at its existing connections and 
 sees none that can be re-used, since the block token differs between the two 
 requesters. Hence, it reconnects, and we end up with hundreds or thousands of 
 IPC connections to the datanode.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-988) saveNamespace can corrupt edits log, apparently due to race conditions

2011-05-23 Thread Matt Foley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated HDFS-988:


Summary: saveNamespace can corrupt edits log, apparently due to race 
conditions  (was: saveNamespace can corrupt edits log)

 saveNamespace can corrupt edits log, apparently due to race conditions
 --

 Key: HDFS-988
 URL: https://issues.apache.org/jira/browse/HDFS-988
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: dhruba borthakur
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.20-append, 0.22.0

 Attachments: HDFS-988_fix_synchs.patch, hdfs-988-2.patch, 
 hdfs-988.txt, saveNamespace.txt, saveNamespace_20-append.patch


 The adminstrator puts the namenode is safemode and then issues the 
 savenamespace command. This can corrupt the edits log. The problem is that  
 when the NN enters safemode, there could still be pending logSycs occuring 
 from other threads. Now, the saveNamespace command, when executed, would save 
 a edits log with partial writes. I have seen this happen on 0.20.
 https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12828853page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12828853

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1603) Namenode gets sticky if one of namenode storage volumes disappears (removed, unmounted, etc.)

2011-05-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038245#comment-13038245
 ] 

Todd Lipcon commented on HDFS-1603:
---

ATM and I just brainstormed about this a little bit over some iced coffee. 
Though on the surface it doesn't look too hard to implement timeouts on namedir 
operations, it would actually have to be done in a lot of places (eg 
mkdirs/move calls on storage directories, writing edits, saving images, etc). 
Timing out some of these things isn't entirely straightforward, since the 
underlying calls aren't interruptible.

At some point we could attempt to tackle it, but looks like a complicated 
project. So, rather than trying to implement this in software for now, it's 
probably better to just recommend the proper NFS mount options when storing 
name dirs on NFS.

 Namenode gets sticky if one of namenode storage volumes disappears (removed, 
 unmounted, etc.)
 -

 Key: HDFS-1603
 URL: https://issues.apache.org/jira/browse/HDFS-1603
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.21.0
Reporter: Konstantin Boudnik

 While investigating failures on HDFS-1602 it became apparent that once a 
 namenode storage volume is pulled out NN becomes completely sticky until 
 {{FSImage:processIOError: removing storage}} move the storage from the active 
 set. During this time none of normal NN operations are possible (e.g. 
 creating a directory on HDFS timeouts eventually).
 In case of NFS this can be workaround'd with soft,intr,timeo,retrans 
 settings. However, a better handling of the situation is apparently possible 
 and needs to be implemented.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1967) TestHDFSTrash failing on trunk and 22

2011-05-23 Thread Matt Foley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated HDFS-1967:
-

Issue Type: Sub-task  (was: Bug)
Parent: HDFS-1852

 TestHDFSTrash failing on trunk and 22
 -

 Key: HDFS-1967
 URL: https://issues.apache.org/jira/browse/HDFS-1967
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: 0.22.0
Reporter: Todd Lipcon
 Fix For: 0.22.0


 Seems to have started failing recently in many commit builds as well as the 
 last two nightly builds of 22:
 https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/51/testReport/org.apache.hadoop.hdfs/TestHDFSTrash/testTrashEmptier/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1967) TestHDFSTrash failing on trunk and 22

2011-05-23 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038276#comment-13038276
 ] 

Matt Foley commented on HDFS-1967:
--

TestHDFSTrash.testTrashEmptier() was failing on almost every Hudson build 
through 605.  However, has not failed for the last four auto-test builds.  
Watch-listing for trunk.

However, we'd like to understand what fixed it (if it is fixed) so we can apply 
the patch to v22 and yahoo-merge branches.

 TestHDFSTrash failing on trunk and 22
 -

 Key: HDFS-1967
 URL: https://issues.apache.org/jira/browse/HDFS-1967
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: 0.22.0
Reporter: Todd Lipcon
 Fix For: 0.22.0


 Seems to have started failing recently in many commit builds as well as the 
 last two nightly builds of 22:
 https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/51/testReport/org.apache.hadoop.hdfs/TestHDFSTrash/testTrashEmptier/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1984) HDFS-1073: Enable multiple checkpointers to run simultaneously

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1984:
--

  Component/s: name-node
  Description: 
One of the motivations of HDFS-1073 is that it decouples the checkpoint process 
so that multiple checkpoints could be taken at the same time and not interfere 
with each other.

Currently on the 1073 branch this doesn't quite work right, since we have some 
state and validation in FSImage that's tied to a single fsimage_N -- thus if 
two 2NNs perform a checkpoint at different transaction IDs, only one will 
succeed.

As a stress test, we can run two 2NNs each configured with the 
fs.checkpoint.interval set to 0 which causes them to continuously checkpoint 
as fast as they can.
Affects Version/s: Edit log branch (HDFS-1073)
Fix Version/s: Edit log branch (HDFS-1073)

 HDFS-1073: Enable multiple checkpointers to run simultaneously
 --

 Key: HDFS-1984
 URL: https://issues.apache.org/jira/browse/HDFS-1984
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)


 One of the motivations of HDFS-1073 is that it decouples the checkpoint 
 process so that multiple checkpoints could be taken at the same time and not 
 interfere with each other.
 Currently on the 1073 branch this doesn't quite work right, since we have 
 some state and validation in FSImage that's tied to a single fsimage_N -- 
 thus if two 2NNs perform a checkpoint at different transaction IDs, only one 
 will succeed.
 As a stress test, we can run two 2NNs each configured with the 
 fs.checkpoint.interval set to 0 which causes them to continuously 
 checkpoint as fast as they can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1984) HDFS-1073: Enable multiple checkpointers to run simultaneously

2011-05-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038288#comment-13038288
 ] 

Todd Lipcon commented on HDFS-1984:
---

Currently this test scenario fails after a few seconds with an exception like:

11/05/23 15:25:46 WARN mortbay.log: /getimage: java.io.IOException: GetImage 
failed. java.io.IOException: Namenode has an edit log corresponding to txid 
1240 but new checkpoint was created using editlog ending at txid 1238. 
Checkpoint Aborted.
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.validateCheckpointUpload(FSImage.java:894)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:107)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:80)

but this validation is bogus. So long as no two checkpointers try to upload a 
checkpoint at the same txid, it's OK if they upload old fsimages.

To fix this, I think we need to do the following:

a) Repurpose the checkpointTxId field of FSImage. This currently tracks the 
last txid at which the NN has either saved or uploaded a checkpoint. We use it 
to advertise which image file a checkpointer should download, but we also use 
it to validate the checkpoint upload. Instead, it should be renamed to 
mostRecentImageTxId and only be used to advertise the image.

b) Remove the imageDigest field. The function of validation is now being done 
by an adjacent .md5 file next to each image. When the checkpointer downloads 
an image, the image transfer servlet can just read the .md5 file and include 
the hash as an HTTP header. The checkpointer can then verify that it 
transferred correctly by comparing the image it downloaded against that md5 
hash. When uploading the new checkpoint back to the NN, the same process is 
used in reverse.

The new validation rules for accepting a checkpoint upload should be:
- the namespace/clusterid/etc match up (same as today)
- the transaction ID of the uploaded image is less than the current transaction 
ID of the namespace (sanity check)
- the hash of the file received matches the hash that the 2NN communicates for 
a header


 HDFS-1073: Enable multiple checkpointers to run simultaneously
 --

 Key: HDFS-1984
 URL: https://issues.apache.org/jira/browse/HDFS-1984
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)


 One of the motivations of HDFS-1073 is that it decouples the checkpoint 
 process so that multiple checkpoints could be taken at the same time and not 
 interfere with each other.
 Currently on the 1073 branch this doesn't quite work right, since we have 
 some state and validation in FSImage that's tied to a single fsimage_N -- 
 thus if two 2NNs perform a checkpoint at different transaction IDs, only one 
 will succeed.
 As a stress test, we can run two 2NNs each configured with the 
 fs.checkpoint.interval set to 0 which causes them to continuously 
 checkpoint as fast as they can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1963) HDFS rpm integration project

2011-05-23 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated HDFS-1963:


Attachment: HDFS-1963-3.patch

Change configuration directory from $PREFIX/conf to $PREFIX/etc/hadoop per 
Owen's recommendation.  For RPM/deb, it will use /etc/hadoop as default, and 
create symlink for $PREFIX/etc/hadoop point to /etc/hadoop.

 HDFS rpm integration project
 

 Key: HDFS-1963
 URL: https://issues.apache.org/jira/browse/HDFS-1963
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: build
 Environment: Java 6, RHEL 5.5
Reporter: Eric Yang
Assignee: Eric Yang
 Attachments: HDFS-1963-1.patch, HDFS-1963-2.patch, HDFS-1963-3.patch, 
 HDFS-1963.patch


 This jira is corresponding to HADOOP-6255 and associated directory layout 
 change.  The patch for creating HDFS rpm packaging should be posted here for 
 patch test build to verify against hdfs svn trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet

2011-05-23 Thread Todd Lipcon (JIRA)
HDFS-1073: Cleanup in image transfer servlet


 Key: HDFS-1985
 URL: https://issues.apache.org/jira/browse/HDFS-1985
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Todd Lipcon
Assignee: Todd Lipcon




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1985:
--

  Component/s: name-node
  Description: 
The TransferFsImage class has grown several heads and is somewhat confusing to 
follow. This JIRA is to refactor it a little bit.
- the TransferFsImage class contains static methods to put/get image and edits 
files. It's used by checkpointing nodes.  [the same static methods it has today]
- some common code from call sites of TransferFsImage are moved into 
TransferFsImage itself, so it presents a cleaner interface to checkpointers
- the non-static parts of TransferFsImage are moved to an inner class of 
GetImageServlet called GetImageParams, since they were only responsible for 
parameter parsing/validation.
Affects Version/s: Edit log branch (HDFS-1073)
Fix Version/s: Edit log branch (HDFS-1073)

 HDFS-1073: Cleanup in image transfer servlet
 

 Key: HDFS-1985
 URL: https://issues.apache.org/jira/browse/HDFS-1985
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)


 The TransferFsImage class has grown several heads and is somewhat confusing 
 to follow. This JIRA is to refactor it a little bit.
 - the TransferFsImage class contains static methods to put/get image and 
 edits files. It's used by checkpointing nodes.  [the same static methods it 
 has today]
 - some common code from call sites of TransferFsImage are moved into 
 TransferFsImage itself, so it presents a cleaner interface to checkpointers
 - the non-static parts of TransferFsImage are moved to an inner class of 
 GetImageServlet called GetImageParams, since they were only responsible for 
 parameter parsing/validation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1985:
--

Attachment: hdfs-1985.txt

Attached patch does the above refactoring and cleanup.

While I was at it, I also made the client code check the HTTP response for 200 
OK status. This fixes the client error reporting behavior in the event that 
the server throws an exception while processing the request.

 HDFS-1073: Cleanup in image transfer servlet
 

 Key: HDFS-1985
 URL: https://issues.apache.org/jira/browse/HDFS-1985
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)

 Attachments: hdfs-1985.txt


 The TransferFsImage class has grown several heads and is somewhat confusing 
 to follow. This JIRA is to refactor it a little bit.
 - the TransferFsImage class contains static methods to put/get image and 
 edits files. It's used by checkpointing nodes.  [the same static methods it 
 has today]
 - some common code from call sites of TransferFsImage are moved into 
 TransferFsImage itself, so it presents a cleaner interface to checkpointers
 - the non-static parts of TransferFsImage are moved to an inner class of 
 GetImageServlet called GetImageParams, since they were only responsible for 
 parameter parsing/validation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-1986) Add an option for user to return http or https ports regardless of security is on/off in DFSUtil.getInfoServer()

2011-05-23 Thread Tanping Wang (JIRA)
Add an option for user to return http or https ports regardless of security is 
on/off in DFSUtil.getInfoServer()


 Key: HDFS-1986
 URL: https://issues.apache.org/jira/browse/HDFS-1986
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 0.23.0
Reporter: Tanping Wang
Assignee: Tanping Wang
Priority: Minor
 Fix For: 0.23.0


Currently DFSUtil.getInfoServer gets http port with security off and httpS port 
with security on.  However, we want to return http port regardless of security 
on/off for Cluster UI to use.  Add in a third Boolean parameter for user to 
decide whether to check security or not.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-422) fuse-dfs leaks FileSystem handles as it never disconnects them because the FileSystem.Cache does not do reference counting

2011-05-23 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins resolved HDFS-422.
--

Resolution: Duplicate

 fuse-dfs leaks FileSystem handles as it never disconnects them because the 
 FileSystem.Cache does not do reference counting
 --

 Key: HDFS-422
 URL: https://issues.apache.org/jira/browse/HDFS-422
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: contrib/fuse-dfs
Reporter: Pete Wyckoff
Priority: Minor

 since users may be doing multiple file operations at the same time, a single 
 task in fuse, can never call close() on a filesystem (ie 
 libhdfs::hdfsDisconnect) because there may be another thread for the same 
 user.
 as such, either fuse-dfs needs to do reference counting or FileSystem.Cache 
 needs to or maybe enable a mode where one can turn off the Cache??
 I currently am not seeing any problems in production, but I am still running 
 0.18 version which keeps only one connection as root.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1986) Add an option for user to return http or https ports regardless of security is on/off in DFSUtil.getInfoServer()

2011-05-23 Thread Tanping Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanping Wang updated HDFS-1986:
---

Attachment: HDFS-1986.patch

 Add an option for user to return http or https ports regardless of security 
 is on/off in DFSUtil.getInfoServer()
 

 Key: HDFS-1986
 URL: https://issues.apache.org/jira/browse/HDFS-1986
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 0.23.0
Reporter: Tanping Wang
Assignee: Tanping Wang
Priority: Minor
 Fix For: 0.23.0

 Attachments: HDFS-1986.patch


 Currently DFSUtil.getInfoServer gets http port with security off and httpS 
 port with security on.  However, we want to return http port regardless of 
 security on/off for Cluster UI to use.  Add in a third Boolean parameter for 
 user to decide whether to check security or not.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet

2011-05-23 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038311#comment-13038311
 ] 

Eli Collins commented on HDFS-1985:
---

+1 pending Hudson.  This is much nicer.

Nit: indent the throws in parseLongParam and downloadImageToStorage

 HDFS-1073: Cleanup in image transfer servlet
 

 Key: HDFS-1985
 URL: https://issues.apache.org/jira/browse/HDFS-1985
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)

 Attachments: hdfs-1985.txt


 The TransferFsImage class has grown several heads and is somewhat confusing 
 to follow. This JIRA is to refactor it a little bit.
 - the TransferFsImage class contains static methods to put/get image and 
 edits files. It's used by checkpointing nodes.  [the same static methods it 
 has today]
 - some common code from call sites of TransferFsImage are moved into 
 TransferFsImage itself, so it presents a cleaner interface to checkpointers
 - the non-static parts of TransferFsImage are moved to an inner class of 
 GetImageServlet called GetImageParams, since they were only responsible for 
 parameter parsing/validation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet

2011-05-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038341#comment-13038341
 ] 

Todd Lipcon commented on HDFS-1985:
---

Will fix indentation nits on commit. This is on the 1073 branch, so Hudson 
won't run against it. I ran TestCheckpoint which covers this code pretty well, 
as well as a subset of tests (all those modified in the last 4 commits on the 
branch). They all passed (except for BN-related ones which are known to be 
broken at the moment)

 HDFS-1073: Cleanup in image transfer servlet
 

 Key: HDFS-1985
 URL: https://issues.apache.org/jira/browse/HDFS-1985
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)

 Attachments: hdfs-1985.txt


 The TransferFsImage class has grown several heads and is somewhat confusing 
 to follow. This JIRA is to refactor it a little bit.
 - the TransferFsImage class contains static methods to put/get image and 
 edits files. It's used by checkpointing nodes.  [the same static methods it 
 has today]
 - some common code from call sites of TransferFsImage are moved into 
 TransferFsImage itself, so it presents a cleaner interface to checkpointers
 - the non-static parts of TransferFsImage are moved to an inner class of 
 GetImageServlet called GetImageParams, since they were only responsible for 
 parameter parsing/validation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HDFS-1985) HDFS-1073: Cleanup in image transfer servlet

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1985.
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]

 HDFS-1073: Cleanup in image transfer servlet
 

 Key: HDFS-1985
 URL: https://issues.apache.org/jira/browse/HDFS-1985
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)

 Attachments: hdfs-1985.txt


 The TransferFsImage class has grown several heads and is somewhat confusing 
 to follow. This JIRA is to refactor it a little bit.
 - the TransferFsImage class contains static methods to put/get image and 
 edits files. It's used by checkpointing nodes.  [the same static methods it 
 has today]
 - some common code from call sites of TransferFsImage are moved into 
 TransferFsImage itself, so it presents a cleaner interface to checkpointers
 - the non-static parts of TransferFsImage are moved to an inner class of 
 GetImageServlet called GetImageParams, since they were only responsible for 
 parameter parsing/validation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1969) Running rollback on new-version namenode destroys namespace

2011-05-23 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038345#comment-13038345
 ] 

Konstantin Shvachko commented on HDFS-1969:
---

Todd, what I meant is that {{NNStorage.setFields()}} should not contain the if 
statement
{code}
if (layoutVersion = -26) {
...
}
{code}
I actually don't see where it is triggered.
In general, we can allow these ifs in the loading part of the code, like 
loadFSImage(). But the saving part should be free of dependencies on the layout 
version, because there is only one LV - the current one.

The precondition sounds good. But it would be better to just convert it to 
assert. I don't think we've used {{com.google.*}} before at least not in HDFS. 
Why introduce it now.

 Running rollback on new-version namenode destroys namespace
 ---

 Key: HDFS-1969
 URL: https://issues.apache.org/jira/browse/HDFS-1969
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.22.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.22.0

 Attachments: hdfs-1969.txt, hdfs-1969.txt


 The following sequence leaves the namespace in an inconsistent/broken state:
 - format NN using 0.20 (or any prior release, probably)
 - run hdfs namenode -upgrade on 0.22. ^C the NN once it comes up.
 - run hdfs namenode -rollback on 0.22  (this should fail but doesn't!)
 This leaves the name directory in a state such that the version file claims 
 it's an 0.20 namespace, but the fsimage is in 0.22 format. It then crashes 
 when trying to start up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-1987) HDFS-1073: Test for 2NN downloading image is not running

2011-05-23 Thread Todd Lipcon (JIRA)
HDFS-1073: Test for 2NN downloading image is not running


 Key: HDFS-1987
 URL: https://issues.apache.org/jira/browse/HDFS-1987
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Todd Lipcon
Assignee: Todd Lipcon




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1987) HDFS-1073: Test for 2NN downloading image is not running

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1987:
--

  Component/s: name-node
  Description: TestCheckpoint.testSecondaryImageDownload was introduced 
at some point but was never called from anywhere, so it wasn't actually 
running. This JIRA is to fix it up to work on trunk and actually run as part of 
the test suite.
Affects Version/s: Edit log branch (HDFS-1073)
Fix Version/s: Edit log branch (HDFS-1073)

 HDFS-1073: Test for 2NN downloading image is not running
 

 Key: HDFS-1987
 URL: https://issues.apache.org/jira/browse/HDFS-1987
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)


 TestCheckpoint.testSecondaryImageDownload was introduced at some point but 
 was never called from anywhere, so it wasn't actually running. This JIRA is 
 to fix it up to work on trunk and actually run as part of the test suite.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-1969) Running rollback on new-version namenode destroys namespace

2011-05-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038351#comment-13038351
 ] 

Todd Lipcon commented on HDFS-1969:
---

bq. I actually don't see where it is triggered.

The test code uses this code in order to create VERSION files that look like 
they came from older versions. We could copy-paste some new code in to generate 
VERSION files, but that's a bit messy too.

bq. The precondition sounds good. But it would be better to just convert it to 
assert. I don't think we've used com.google.* before at least not in HDFS. Why 
introduce it now.

There was a vote on the mailing list a few months back and people said they 
were OK with including Guava (com.google.*). The advantage of Preconditions 
over assert is that Preconditions will always run regardless of JVM options. In 
areas that aren't performance-sensitive, this is preferred.

 Running rollback on new-version namenode destroys namespace
 ---

 Key: HDFS-1969
 URL: https://issues.apache.org/jira/browse/HDFS-1969
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.22.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.22.0

 Attachments: hdfs-1969.txt, hdfs-1969.txt


 The following sequence leaves the namespace in an inconsistent/broken state:
 - format NN using 0.20 (or any prior release, probably)
 - run hdfs namenode -upgrade on 0.22. ^C the NN once it comes up.
 - run hdfs namenode -rollback on 0.22  (this should fail but doesn't!)
 This leaves the name directory in a state such that the version file claims 
 it's an 0.20 namespace, but the fsimage is in 0.22 format. It then crashes 
 when trying to start up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1987) HDFS-1073: Test for 2NN downloading image is not running

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1987:
--

Attachment: hdfs-1987.txt

Attached patch makes this test case actually run, and fixes it up to work with 
the new edits log layout.

 HDFS-1073: Test for 2NN downloading image is not running
 

 Key: HDFS-1987
 URL: https://issues.apache.org/jira/browse/HDFS-1987
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)

 Attachments: hdfs-1987.txt


 TestCheckpoint.testSecondaryImageDownload was introduced at some point but 
 was never called from anywhere, so it wasn't actually running. This JIRA is 
 to fix it up to work on trunk and actually run as part of the test suite.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-1984) HDFS-1073: Enable multiple checkpointers to run simultaneously

2011-05-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1984:
--

Attachment: hdfs-1984.txt

Here's a patch that does the above, and also adds two new test cases:

1) simulates a corrupt byte while transferring the image, making sure it 
correctly detects it and rejects the upload
2) runs two 2NNs interleaved using Mockito to be sure that they don't interfere 
with each other

I also ran the test from the command line as described above. I was able to run 
two 2NNs both checkpointing as fast as they could. There was one minor 
unrelated race condition that I'll address as a followup.

 HDFS-1073: Enable multiple checkpointers to run simultaneously
 --

 Key: HDFS-1984
 URL: https://issues.apache.org/jira/browse/HDFS-1984
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: name-node
Affects Versions: Edit log branch (HDFS-1073)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Edit log branch (HDFS-1073)

 Attachments: hdfs-1984.txt


 One of the motivations of HDFS-1073 is that it decouples the checkpoint 
 process so that multiple checkpoints could be taken at the same time and not 
 interfere with each other.
 Currently on the 1073 branch this doesn't quite work right, since we have 
 some state and validation in FSImage that's tied to a single fsimage_N -- 
 thus if two 2NNs perform a checkpoint at different transaction IDs, only one 
 will succeed.
 As a stress test, we can run two 2NNs each configured with the 
 fs.checkpoint.interval set to 0 which causes them to continuously 
 checkpoint as fast as they can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira