[jira] Commented: (HDFS-1267) fuse-dfs does not compile

2010-06-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884183#action_12884183
 ] 

Hudson commented on HDFS-1267:
--

Integrated in Hadoop-Hdfs-trunk-Commit #324 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/324/])
HDFS-1267. fuse-dfs does not compile. Contributed by Devaraj Das


> fuse-dfs does not compile
> -
>
> Key: HDFS-1267
> URL: https://issues.apache.org/jira/browse/HDFS-1267
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/fuse-dfs
>Reporter: Tom White
>Assignee: Devaraj Das
>Priority: Critical
> Fix For: 0.21.0
>
> Attachments: 1267-1.patch
>
>
> Looks like since libhdfs was updated to use the new UGI (HDFS-1000) fuse-dfs 
> no longer compiles:
> {noformat}
>  [exec] fuse_connect.c: In function 'doConnectAsUser':
>  [exec] fuse_connect.c:40: error: too many arguments to function 
> 'hdfsConnectAsUser'
> {noformat}
> Any takers to fix this please?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1199) Extract a subset of tests for smoke (DOA) validation.

2010-06-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884178#action_12884178
 ] 

Hadoop QA commented on HDFS-1199:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448467/HDFS-1199.patch
  against trunk revision 959324.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 51 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 103 release audit warnings 
(more than the trunk's current 102 warnings).

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/417/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/417/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/417/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/417/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/417/console

This message is automatically generated.

> Extract a subset of tests for smoke (DOA) validation.
> -
>
> Key: HDFS-1199
> URL: https://issues.apache.org/jira/browse/HDFS-1199
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.21.0
>Reporter: Konstantin Boudnik
>Assignee: Konstantin Boudnik
> Attachments: HDFS-1199.patch, HDFS-1199.patch
>
>
> Similar to that of HADOOP-6810 for HDFS.
> Adds an ability to run up to 30 minutes of the tests to 'smoke' HDFS build 
> i.e. find possible issues faster than the full test cycle does). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1267) fuse-dfs does not compile

2010-06-30 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated HDFS-1267:


Status: Resolved  (was: Patch Available)
  Assignee: Devaraj Das
Resolution: Fixed

I've just committed this.

> fuse-dfs does not compile
> -
>
> Key: HDFS-1267
> URL: https://issues.apache.org/jira/browse/HDFS-1267
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/fuse-dfs
>Reporter: Tom White
>Assignee: Devaraj Das
>Priority: Critical
> Fix For: 0.21.0
>
> Attachments: 1267-1.patch
>
>
> Looks like since libhdfs was updated to use the new UGI (HDFS-1000) fuse-dfs 
> no longer compiles:
> {noformat}
>  [exec] fuse_connect.c: In function 'doConnectAsUser':
>  [exec] fuse_connect.c:40: error: too many arguments to function 
> 'hdfsConnectAsUser'
> {noformat}
> Any takers to fix this please?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1271) Decommissioning nodes not persisted between NameNode restarts

2010-06-30 Thread Andrew Ryan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884171#action_12884171
 ] 

Andrew Ryan commented on HDFS-1271:
---

Operational note: We decommission nodes a lot (like, one or more per day). 
Whenever we detect a failed component on a node that is going to necessitate 
someone taking it down for repairs, we decommission as a preventative measure, 
because our SiteOps team could do the repair at any time after that. 
Occasionally I have seen nodes take hours or days or more to decommission, if 
they are unlucky recipients of a block from a file from some very long-running 
job.

IIRC it's not just NN restarts that are a problem, DN restarts also have the 
same problem because when the DN comes up it will be denied communication by NN 
and immediately shut down, even if it hasn't finished decom'ing. Then you have 
to un-exclude and re-exclude to get the blocks flowing again. But at least 
there with manual intervention you can get your blocks back.

> Decommissioning nodes not persisted between NameNode restarts
> -
>
> Key: HDFS-1271
> URL: https://issues.apache.org/jira/browse/HDFS-1271
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Travis Crawford
>
> Datanodes in the process of being decomissioned should still be 
> decomissioning after namenode restarts. Currently they are marked as dead 
> after a restart.
> Details:
> Nodes can be safely removed from a cluster by marking them as decomissioned 
> and waiting for their data to be replicated elsewhere. This is accomplished 
> by adding a node to the filed referenced by dfs.hosts.excluded, then 
> refreshing nodes.
> Decomissioning means block reports from the decomissioned datanode are no 
> longer accepted by the namenode, meaning for decomissioning to occur the NN 
> must have an existing block report. That is, a datanode can transition from: 
> live --> decomissioning --> dead. Nodes can NOT transition from: dead --> 
> decomissioning --> dead.
> Operationally this is problematic because intervention is required should the 
> NN restart while nodes are decomissioning, meaning in-house administration 
> tools must be more complex, or more likely admins have to babysit the 
> decomissioning process.
> Someone more familiar with the code might have a better idea, but perhaps the 
> first block report for dfs.hosts.excluded hosts should be accepted?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1212) Harmonize HDFS JAR library versions with Common

2010-06-30 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated HDFS-1212:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

I've just committed this. (Test failures were unrelated.)

> Harmonize HDFS JAR library versions with Common
> ---
>
> Key: HDFS-1212
> URL: https://issues.apache.org/jira/browse/HDFS-1212
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Reporter: Tom White
>Assignee: Tom White
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: HDFS-1212.patch, HDFS-1212.patch, HDFS-1212.patch, 
> HDFS-1212.patch
>
>
> HDFS part of HADOOP-6800.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1199) Extract a subset of tests for smoke (DOA) validation.

2010-06-30 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-1199:
-

Attachment: HDFS-1199.patch

The rename of a target to make it more consistent.

> Extract a subset of tests for smoke (DOA) validation.
> -
>
> Key: HDFS-1199
> URL: https://issues.apache.org/jira/browse/HDFS-1199
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.21.0
>Reporter: Konstantin Boudnik
>Assignee: Konstantin Boudnik
> Attachments: HDFS-1199.patch, HDFS-1199.patch
>
>
> Similar to that of HADOOP-6810 for HDFS.
> Adds an ability to run up to 30 minutes of the tests to 'smoke' HDFS build 
> i.e. find possible issues faster than the full test cycle does). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1199) Extract a subset of tests for smoke (DOA) validation.

2010-06-30 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-1199:
-

Status: Patch Available  (was: Open)

> Extract a subset of tests for smoke (DOA) validation.
> -
>
> Key: HDFS-1199
> URL: https://issues.apache.org/jira/browse/HDFS-1199
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.21.0
>Reporter: Konstantin Boudnik
>Assignee: Konstantin Boudnik
> Attachments: HDFS-1199.patch, HDFS-1199.patch
>
>
> Similar to that of HADOOP-6810 for HDFS.
> Adds an ability to run up to 30 minutes of the tests to 'smoke' HDFS build 
> i.e. find possible issues faster than the full test cycle does). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1004) Update NN to support Kerberized SSL from HADOOP-6584

2010-06-30 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884068#action_12884068
 ] 

Kan Zhang commented on HDFS-1004:
-

Patch h6584-03.patch has been manually tested together with c6584-02.patch of 
HADOOP-6584. I also ran "ant test" and it passed.

> Update NN to support Kerberized SSL from HADOOP-6584
> 
>
> Key: HDFS-1004
> URL: https://issues.apache.org/jira/browse/HDFS-1004
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Jakob Homan
>Assignee: Jakob Homan
> Attachments: h6584-03.patch, HDFS-1004.patch
>
>
> Namenode needs to be tweaked to use the new kerberized-back ssl connector.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1277) [Herriot] New property for multi user list.

2010-06-30 Thread Konstantin Boudnik (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884029#action_12884029
 ] 

Konstantin Boudnik commented on HDFS-1277:
--

+1 patch looks good (subject of usual validation process)

> [Herriot] New property for multi user list.
> ---
>
> Key: HDFS-1277
> URL: https://issues.apache.org/jira/browse/HDFS-1277
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: test
>Reporter: Vinay Kumar Thota
>Assignee: Vinay Kumar Thota
> Attachments: HDFS-1277.patch
>
>
> Adding new property for multi user list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1277) [Herriot] New property for multi user list.

2010-06-30 Thread Vinay Kumar Thota (JIRA)
[Herriot] New property for multi user list.
---

 Key: HDFS-1277
 URL: https://issues.apache.org/jira/browse/HDFS-1277
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Reporter: Vinay Kumar Thota
Assignee: Vinay Kumar Thota


Adding new property for multi user list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1277) [Herriot] New property for multi user list.

2010-06-30 Thread Vinay Kumar Thota (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay Kumar Thota updated HDFS-1277:


Attachment: HDFS-1277.patch

Patch for trunk.

> [Herriot] New property for multi user list.
> ---
>
> Key: HDFS-1277
> URL: https://issues.apache.org/jira/browse/HDFS-1277
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: test
>Reporter: Vinay Kumar Thota
>Assignee: Vinay Kumar Thota
> Attachments: HDFS-1277.patch
>
>
> Adding new property for multi user list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1258) Clearing namespace quota on "/" corrupts FS image

2010-06-30 Thread Jakob Homan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated HDFS-1258:
--

Hadoop Flags: [Reviewed]

+1.  Looks good.

> Clearing namespace quota on "/" corrupts FS image
> -
>
> Key: HDFS-1258
> URL: https://issues.apache.org/jira/browse/HDFS-1258
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Blocker
> Fix For: 0.20.3, 0.21.0, 0.22.0
>
> Attachments: clear-quota.patch, clear-quota.patch
>
>
> The HDFS root directory starts out with a default namespace quota of 
> Integer.MAX_VALUE. If you clear this quota (using "hadoop dfsadmin -clrQuota 
> /"), the fsimage gets corrupted immediately. Subsequent 2NN rolls will fail, 
> and the NN will not come back up from a restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1258) Clearing namespace quota on "/" corrupts FS image

2010-06-30 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883979#action_12883979
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1258:
--

The test report web site is not available at the moment.  Will check it later.

Aaron, could you also provide patches for 0.20 and 0.21?

> Clearing namespace quota on "/" corrupts FS image
> -
>
> Key: HDFS-1258
> URL: https://issues.apache.org/jira/browse/HDFS-1258
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Blocker
> Fix For: 0.20.3, 0.21.0, 0.22.0
>
> Attachments: clear-quota.patch, clear-quota.patch
>
>
> The HDFS root directory starts out with a default namespace quota of 
> Integer.MAX_VALUE. If you clear this quota (using "hadoop dfsadmin -clrQuota 
> /"), the fsimage gets corrupted immediately. Subsequent 2NN rolls will fail, 
> and the NN will not come back up from a restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-30 Thread Hairong Kuang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hairong Kuang updated HDFS-1057:


Fix Version/s: 0.21.0
   0.22.0

I've committed the trunk change to 0.21 and trunk. Thanks Sam!

> Concurrent readers hit ChecksumExceptions if following a writer to very end 
> of file
> ---
>
> Key: HDFS-1057
> URL: https://issues.apache.org/jira/browse/HDFS-1057
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node
>Affects Versions: 0.20-append, 0.21.0, 0.22.0
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Blocker
> Fix For: 0.20-append, 0.21.0, 0.22.0
>
> Attachments: conurrent-reader-patch-1.txt, 
> conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
> HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
> hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
> hdfs-1057-trunk-6.txt
>
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
> calling flush(). Therefore, if there is a concurrent reader, it's possible to 
> race here - the reader will see the new length while those bytes are still in 
> the buffers of BlockReceiver. Thus the client will potentially see checksum 
> errors or EOFs. Additionally, the last checksum chunk of the file is made 
> accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN

2010-06-30 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated HDFS-1262:
---

Attachment: hdfs-1262-2.txt

removed hdfs-894 change from patch (commit this to 0.20-append separately)

> Failed pipeline creation during append leaves lease hanging on NN
> -
>
> Key: HDFS-1262
> URL: https://issues.apache.org/jira/browse/HDFS-1262
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN

2010-06-30 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883855#action_12883855
 ] 

sam rash commented on HDFS-1262:


that's probably better.  this was dependent on it as i was killing the 
datanodes to simulate the pipeline failure.  i ended up tuning the test case to 
use mockito to throw exceptions at the end of a NN rpc call for both append() 
and create(), so I think that dependency is gone. 

can we mark this as dependent on that if it turns out to be needed?


> Failed pipeline creation during append leaves lease hanging on NN
> -
>
> Key: HDFS-1262
> URL: https://issues.apache.org/jira/browse/HDFS-1262
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1262-1.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN

2010-06-30 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883857#action_12883857
 ] 

sam rash commented on HDFS-1262:


verified test case passes w/o that patch.  we should commit hdfs-894 to 
20-append for sure, though.  that seems like a potentially gnarly bug in tests 
to track down (took me a short spell)

i can upload the patch w/o the DatanodeID

> Failed pipeline creation during append leaves lease hanging on NN
> -
>
> Key: HDFS-1262
> URL: https://issues.apache.org/jira/browse/HDFS-1262
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1262-1.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN

2010-06-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883849#action_12883849
 ] 

Todd Lipcon commented on HDFS-1262:
---

The bug mentioned above is HDFS-894 - we just didn't commit it to 20 branch. 
Maybe best to commit that one under the aegis of that issue rather than here?

> Failed pipeline creation during append leaves lease hanging on NN
> -
>
> Key: HDFS-1262
> URL: https://issues.apache.org/jira/browse/HDFS-1262
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1262-1.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1268) Extract blockInvalidateLimit as a seperated configuration

2010-06-30 Thread jinglong.liujl (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883848#action_12883848
 ] 

jinglong.liujl commented on HDFS-1268:
--

Do you mean "BLOCK_INVALIDATE_CHUNK" ?
Currently, invalidBlocklimit is computed by max (BLOCK_INVALIDATE_CHUNK, 20 * 
heartbeatInterval), If I want to modified this parameter, there're two choice.
1. set BLOCK_INVALIDATE_CHUNK  a huge one, and re-compile code
2. increase heartbeat interval, but it can not carry more blocks totally.
what I want is made invalidBlocklimit  can be configed by user.

If our cluster meet this "corner case", (in fact, it's not corner case, we use 
hbase in this cluster, this case can be seen very often.), why not config this 
parameter and restart cluster to prevent this issue? 


> Extract blockInvalidateLimit as a seperated configuration
> -
>
> Key: HDFS-1268
> URL: https://issues.apache.org/jira/browse/HDFS-1268
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: jinglong.liujl
> Attachments: patch.diff
>
>
>   If there're many file piled up in recentInvalidateSets, only 
> Math.max(blockInvalidateLimit, 
> 20*(int)(heartbeatInterval/1000)) invalid blocks can be carried in a 
> heartbeat.(By default, It's 100). In high write stress, it'll cause process 
> of invalidate blocks removing can not catch up with  speed of writing. 
> We extract blockInvalidateLimit  to a sperate config parameter that user 
> can make the right configure for your cluster. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN

2010-06-30 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883830#action_12883830
 ] 

sam rash commented on HDFS-1262:


above is from DatanodeId.java

> Failed pipeline creation during append leaves lease hanging on NN
> -
>
> Key: HDFS-1262
> URL: https://issues.apache.org/jira/browse/HDFS-1262
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1262-1.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN

2010-06-30 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883828#action_12883828
 ] 

sam rash commented on HDFS-1262:


one note:

{code}
 public void updateRegInfo(DatanodeID nodeReg) {
   name = nodeReg.getName();
   infoPort = nodeReg.getInfoPort();
   // update any more fields added in future.
 }
{code}

should be:
{code}
public void updateRegInfo(DatanodeID nodeReg) {
   name = nodeReg.getName();
   infoPort = nodeReg.getInfoPort();
   ipcPort = nodeReg.getIpcPort();
   // update any more fields added in future.
 }
{code}

it wasn't copying the ipcPort for some reason.

My patch includes this fix

trunk doesn't have this bug

> Failed pipeline creation during append leaves lease hanging on NN
> -
>
> Key: HDFS-1262
> URL: https://issues.apache.org/jira/browse/HDFS-1262
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: sam rash
>Priority: Critical
> Fix For: 0.20-append
>
> Attachments: hdfs-1262-1.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.