date:20100624


 [ 
https://issues.apache.org/jira/browse/HDFS-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1263.
---

Resolution: Duplicate

Yep. Initially wasn't entirely clear they were the same, but I think you're 
right. And phew, very glad it was a simple fix :)

 0.20: in tryUpdateBlock, the meta file is renamed away before genstamp 
 validation is done
 -

 Key: HDFS-1263
 URL: https://issues.apache.org/jira/browse/HDFS-1263
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append


 Saw an issue where multiple datanodes are trying to recover at the same time, 
 and all of them failed. I think the issue is in FSDataset.tryUpdateBlock, we 
 do the rename of blk_B_OldGS to blk_B_OldGS_tmpNewGS and *then* check that 
 the generation stamp is moving upwards. Because of this, invalid update block 
 calls are blocked, but they then cause future updateBlock calls to fail with 
 Meta file not found errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN


[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882160#action_12882160
 ] 

Todd Lipcon commented on HDFS-1262:
---

Oof, what a hack that would be :) Not to say we shouldn't do that over in HBase 
in the short term, but I agree let's treat this as a serious bug in 0.20-append 
and try to fix it on the HDFS side unless we really can't think of any 
implementable solutions.

Can you think of a problem with the abandonBlock() solution? My thinking is 
that we'd check if the block is the last block of a file under construction by 
the abandoning client, and if so, reassign lease to NN_Recovery and initiate 
block synchronization from the NN as if the lease were lost. It may not be 
necessary to go through the whole recovery process, but it will be safer in 
case the client half set up a pipeline before failing, or somesuch.

 Failed pipeline creation during append leaves lease hanging on NN
 -

 Key: HDFS-1262
 URL: https://issues.apache.org/jira/browse/HDFS-1262
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


 Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
 was the following:
 1) File's original writer died
 2) Recovery client tried to open file for append - looped for a minute or so 
 until soft lease expired, then append call initiated recovery
 3) Recovery completed successfully
 4) Recovery client calls append again, which succeeds on the NN
 5) For some reason, the block recovery that happens at the start of append 
 pipeline creation failed on all datanodes 6 times, causing the append() call 
 to throw an exception back to HBase master. HBase assumed the file wasn't 
 open and put it back on a queue to try later
 6) Some time later, it tried append again, but the lease was still assigned 
 to the same DFS client, so it wasn't able to recover.
 The recovery failure in step 5 is a separate issue, but the problem for this 
 JIRA is that the NN can think it failed to open a file for append when the NN 
 thinks the writer holds a lease. Since the writer keeps renewing its lease, 
 recovery never happens, and no one can open or recover the file until the DFS 
 client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN


[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882175#action_12882175
 ] 

sam rash commented on HDFS-1262:


we actually use a new FileSystem instance per file in scribe.  see
http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

there are some downsides to this (creating a new FileSystem instance can be 
expensive, issuing fork  exec calls for 'whoami' and 'groups', but we have 
patches to minimize this)



 Failed pipeline creation during append leaves lease hanging on NN
 -

 Key: HDFS-1262
 URL: https://issues.apache.org/jira/browse/HDFS-1262
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


 Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
 was the following:
 1) File's original writer died
 2) Recovery client tried to open file for append - looped for a minute or so 
 until soft lease expired, then append call initiated recovery
 3) Recovery completed successfully
 4) Recovery client calls append again, which succeeds on the NN
 5) For some reason, the block recovery that happens at the start of append 
 pipeline creation failed on all datanodes 6 times, causing the append() call 
 to throw an exception back to HBase master. HBase assumed the file wasn't 
 open and put it back on a queue to try later
 6) Some time later, it tried append again, but the lease was still assigned 
 to the same DFS client, so it wasn't able to recover.
 The recovery failure in step 5 is a separate issue, but the problem for this 
 JIRA is that the NN can think it failed to open a file for append when the NN 
 thinks the writer holds a lease. Since the writer keeps renewing its lease, 
 recovery never happens, and no one can open or recover the file until the DFS 
 client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN


[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882176#action_12882176
 ] 

sam rash commented on HDFS-1262:


i am also wondering why this hasn't shown up in regular create calls sometime.  
both DFSClient.append() and DFSClient.create() are susceptible to the same 
problem (client has lease, then throws exception setting up pipeline)

 Failed pipeline creation during append leaves lease hanging on NN
 -

 Key: HDFS-1262
 URL: https://issues.apache.org/jira/browse/HDFS-1262
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


 Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
 was the following:
 1) File's original writer died
 2) Recovery client tried to open file for append - looped for a minute or so 
 until soft lease expired, then append call initiated recovery
 3) Recovery completed successfully
 4) Recovery client calls append again, which succeeds on the NN
 5) For some reason, the block recovery that happens at the start of append 
 pipeline creation failed on all datanodes 6 times, causing the append() call 
 to throw an exception back to HBase master. HBase assumed the file wasn't 
 open and put it back on a queue to try later
 6) Some time later, it tried append again, but the lease was still assigned 
 to the same DFS client, so it wasn't able to recover.
 The recovery failure in step 5 is a separate issue, but the problem for this 
 JIRA is that the NN can think it failed to open a file for append when the NN 
 thinks the writer holds a lease. Since the writer keeps renewing its lease, 
 recovery never happens, and no one can open or recover the file until the DFS 
 client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1257) Race condition introduced by HADOOP-5124

2010-06-24 Thread Ramkumar Vadali (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882228#action_12882228
 ] 

Ramkumar Vadali commented on HDFS-1257:
---

I will try to reproduce this with a unit-test, and will update with the results.

 Race condition introduced by HADOOP-5124
 

 Key: HDFS-1257
 URL: https://issues.apache.org/jira/browse/HDFS-1257
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Ramkumar Vadali

 HADOOP-5124 provided some improvements to FSNamesystem#recentInvalidateSets. 
 But it introduced unprotected access to the data structure 
 recentInvalidateSets. Specifically, FSNamesystem.computeInvalidateWork 
 accesses recentInvalidateSets without read-lock protection. If there is 
 concurrent activity (like reducing replication on a file) that adds to 
 recentInvalidateSets, the name-node crashes with a 
 ConcurrentModificationException.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1266) Missing license headers in branch-20-append

2010-06-24 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882231#action_12882231
 ] 

Doug Cutting commented on HDFS-1266:


Perhaps rat should be run as part of the normal patch testing?

 Missing license headers in branch-20-append
 ---

 Key: HDFS-1266
 URL: https://issues.apache.org/jira/browse/HDFS-1266
 Project: Hadoop HDFS
  Issue Type: Task
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Trivial
 Fix For: 0.20-append


 We appear to have some files without license headers, we should do a quick 
 pass through and fix them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1256) libhdfs is missing from the tarball

2010-06-24 Thread Tom White (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated HDFS-1256:


Attachment: HDFS-1256.patch

Slight modification to create the c++ build directory, which is needed if you 
are not building native code.

Results of test-patch:

{noformat}
 [exec] -1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no new tests are needed 
for this patch.
 [exec] Also please list what manual steps were 
performed to verify this patch.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{noformat}

There are no tests since this is a packaging change.

 libhdfs is missing from the tarball
 ---

 Key: HDFS-1256
 URL: https://issues.apache.org/jira/browse/HDFS-1256
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Tom White
Assignee: Tom White
Priority: Blocker
 Fix For: 0.21.0

 Attachments: HDFS-1256.patch, HDFS-1256.patch


 It is being compiled, but is not added to the distribution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table


[ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882253#action_12882253
 ] 

Suresh Srinivas commented on HDFS-1114:
---

Minor comment: TestGSet in Y20 version can be a junit4 test. You do not need to 
extend TestCase.

+1 for the patch.

 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Fix For: 0.22.0

 Attachments: benchmark20100618.patch, GSet20100525.pdf, 
 gset20100608.pdf, h1114_20100607.patch, h1114_20100614b.patch, 
 h1114_20100615.patch, h1114_20100616b.patch, h1114_20100617.patch, 
 h1114_20100617b.patch, h1114_20100617b_y0.20.1xx.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1257) Race condition introduced by HADOOP-5124

2010-06-24 Thread Scott Carey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882254#action_12882254
 ] 

Scott Carey commented on HDFS-1257:
---

Is ConcurrentHashMap an option here?  It is significantly more efficient than 
Collections.synchronizedMap().


I believe ConcurrentModificationException can happen if a put() on a key 
happens at the same time as a get() with the normal HashMap.  Furthermore, if a 
put() or other operation causes the map to re-size, and there is a get() 
concurrently in progress, it can lead to the get ending up in an infinite loop. 
 I've seen that one many times.  ConcurrentHashMap doesn't have that issue.

 Race condition introduced by HADOOP-5124
 

 Key: HDFS-1257
 URL: https://issues.apache.org/jira/browse/HDFS-1257
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Ramkumar Vadali

 HADOOP-5124 provided some improvements to FSNamesystem#recentInvalidateSets. 
 But it introduced unprotected access to the data structure 
 recentInvalidateSets. Specifically, FSNamesystem.computeInvalidateWork 
 accesses recentInvalidateSets without read-lock protection. If there is 
 concurrent activity (like reducing replication on a file) that adds to 
 recentInvalidateSets, the name-node crashes with a 
 ConcurrentModificationException.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1250) Namenode accepts block report from dead datanodes


[ 
https://issues.apache.org/jira/browse/HDFS-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882258#action_12882258
 ] 

Suresh Srinivas commented on HDFS-1250:
---

Ignore the previous comment. Posted it to wrong jira.

 Namenode accepts block report from dead datanodes
 -

 Key: HDFS-1250
 URL: https://issues.apache.org/jira/browse/HDFS-1250
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.2, 0.22.0
Reporter: Suresh Srinivas
Assignee: Suresh Srinivas

 When a datanode heartbeat times out namenode marks it dead. The subsequent 
 heartbeat from the datanode is rejected with a command to datanode to 
 re-register. However namenode accepts block report from the datanode although 
 it is marked dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-752) Add interface classification stable scope to HDFS


[ 
https://issues.apache.org/jira/browse/HDFS-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882259#action_12882259
 ] 

Suresh Srinivas commented on HDFS-752:
--

Realized I had posted this comment to another jira. Here is the test I ran 
before committing the patch to trunk:

Hudson is stuck. I ran testpatch; here is the result:
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or 
modified tests.
[exec] Please justify why no new tests are needed for this patch.
[exec] Also please list what manual steps were performed to verify this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac 
compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number 
of release audit warnings.

This change just tags the code with interface classification. Hence tests are 
not included. I also ran unit tests and it ran without failures.

 Add interface classification stable  scope to HDFS
 ---

 Key: HDFS-752
 URL: https://issues.apache.org/jira/browse/HDFS-752
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.21.0, 0.22.0
Reporter: Suresh Srinivas
Assignee: Suresh Srinivas
 Fix For: 0.22.0

 Attachments: HDFS-752.1.patch, HDFS-752.patch, HDFS-752.rel21.patch, 
 hdfs.interface.txt


 This jira addresses adding interface classification for the classes in hadoop 
 hdfs, based on the mechanism described in Hadoop-5073.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN


[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882263#action_12882263
 ] 

sam rash commented on HDFS-1262:


todd: can you confirm if the exception was from the namenode.append() call or 
creating the output stream?  (sounds like the latter, in the lease recovery it 
initiates)

{code}
  OutputStream append(String src, int buffersize, Progressable progress
  ) throws IOException {
checkOpen();
FileStatus stat = null;
LocatedBlock lastBlock = null;
try {
  stat = getFileInfo(src);
  lastBlock = namenode.append(src, clientName);
} catch(RemoteException re) {
  throw re.unwrapRemoteException(FileNotFoundException.class,
 AccessControlException.class,
 NSQuotaExceededException.class,
 DSQuotaExceededException.class);
}
OutputStream result = new DFSOutputStream(src, buffersize, progress,
lastBlock, stat, conf.getInt(io.bytes.per.checksum, 512));
leasechecker.put(src, result);
return result;
  }
{code}


either way, i think the right way to do this is add back an abandonFile RPC 
call in the NN.   Even if we don't change function call signatures for 
abandonBlock, we will break client/server compatibility.

thoughts?

 Failed pipeline creation during append leaves lease hanging on NN
 -

 Key: HDFS-1262
 URL: https://issues.apache.org/jira/browse/HDFS-1262
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


 Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
 was the following:
 1) File's original writer died
 2) Recovery client tried to open file for append - looped for a minute or so 
 until soft lease expired, then append call initiated recovery
 3) Recovery completed successfully
 4) Recovery client calls append again, which succeeds on the NN
 5) For some reason, the block recovery that happens at the start of append 
 pipeline creation failed on all datanodes 6 times, causing the append() call 
 to throw an exception back to HBase master. HBase assumed the file wasn't 
 open and put it back on a queue to try later
 6) Some time later, it tried append again, but the lease was still assigned 
 to the same DFS client, so it wasn't able to recover.
 The recovery failure in step 5 is a separate issue, but the problem for this 
 JIRA is that the NN can think it failed to open a file for append when the NN 
 thinks the writer holds a lease. Since the writer keeps renewing its lease, 
 recovery never happens, and no one can open or recover the file until the DFS 
 client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table

2010-06-24 Thread Tsz Wo (Nicholas), SZE (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-1114:
-

Attachment: h1114_20100617b2_y0.20.1xx.patch

Thanks Suresh.

h1114_20100617b2_y0.20.1xx.patch: used the original junit 4 TestGSet

 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Fix For: 0.22.0

 Attachments: benchmark20100618.patch, GSet20100525.pdf, 
 gset20100608.pdf, h1114_20100607.patch, h1114_20100614b.patch, 
 h1114_20100615.patch, h1114_20100616b.patch, h1114_20100617.patch, 
 h1114_20100617b.patch, h1114_20100617b2_y0.20.1xx.patch, 
 h1114_20100617b_y0.20.1xx.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1256) libhdfs is missing from the tarball

2010-06-24 Thread Tom White (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated HDFS-1256:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

I've just committed this.

 libhdfs is missing from the tarball
 ---

 Key: HDFS-1256
 URL: https://issues.apache.org/jira/browse/HDFS-1256
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Tom White
Assignee: Tom White
Priority: Blocker
 Fix For: 0.21.0

 Attachments: HDFS-1256.patch, HDFS-1256.patch


 It is being compiled, but is not added to the distribution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN


[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882273#action_12882273
 ] 

Todd Lipcon commented on HDFS-1262:
---

Yea, the exception in this case was during the new DFSOutputStream - 
specifically the block recovery part as you mentioned. I think the actual cause 
of that failure was HDFS-1260 but we should still address this issue too.

You're right, we should add a new RPC rather than overloading abandonBlock. 
abandonFile makes sense, or abandonAppendBlock, or something like that. Do you 
want to take a crack at implementing it or should I?

 Failed pipeline creation during append leaves lease hanging on NN
 -

 Key: HDFS-1262
 URL: https://issues.apache.org/jira/browse/HDFS-1262
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


 Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
 was the following:
 1) File's original writer died
 2) Recovery client tried to open file for append - looped for a minute or so 
 until soft lease expired, then append call initiated recovery
 3) Recovery completed successfully
 4) Recovery client calls append again, which succeeds on the NN
 5) For some reason, the block recovery that happens at the start of append 
 pipeline creation failed on all datanodes 6 times, causing the append() call 
 to throw an exception back to HBase master. HBase assumed the file wasn't 
 open and put it back on a queue to try later
 6) Some time later, it tried append again, but the lease was still assigned 
 to the same DFS client, so it wasn't able to recover.
 The recovery failure in step 5 is a separate issue, but the problem for this 
 JIRA is that the NN can think it failed to open a file for append when the NN 
 thinks the writer holds a lease. Since the writer keeps renewing its lease, 
 recovery never happens, and no one can open or recover the file until the DFS 
 client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN


[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882283#action_12882283
 ] 

sam rash commented on HDFS-1262:


i'd appreciate the chance to implement it, actually.  Thanks

re: the name, according to Dhruba, there used to be one called abandonFile 
which had the semantics we need.  Also, a similar error can occur on non-append 
creates, so probably having append in the name doesn't make sense.  abandonFile 
or another idea?

 Failed pipeline creation during append leaves lease hanging on NN
 -

 Key: HDFS-1262
 URL: https://issues.apache.org/jira/browse/HDFS-1262
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


 Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
 was the following:
 1) File's original writer died
 2) Recovery client tried to open file for append - looped for a minute or so 
 until soft lease expired, then append call initiated recovery
 3) Recovery completed successfully
 4) Recovery client calls append again, which succeeds on the NN
 5) For some reason, the block recovery that happens at the start of append 
 pipeline creation failed on all datanodes 6 times, causing the append() call 
 to throw an exception back to HBase master. HBase assumed the file wasn't 
 open and put it back on a queue to try later
 6) Some time later, it tried append again, but the lease was still assigned 
 to the same DFS client, so it wasn't able to recover.
 The recovery failure in step 5 is a separate issue, but the problem for this 
 JIRA is that the NN can think it failed to open a file for append when the NN 
 thinks the writer holds a lease. Since the writer keeps renewing its lease, 
 recovery never happens, and no one can open or recover the file until the DFS 
 client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN

[
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882290#action_12882290
]

Todd Lipcon commented on HDFS-1262:
---

Ah, the abandonFile RPC must have been before my time. Looks like it got
deleted in 0.18. Go for it! :)

Failed pipeline creation during append leaves lease hanging on NN
-

Key: HDFS-1262
URL: https://issues.apache.org/jira/browse/HDFS-1262
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Priority: Critical
Fix For: 0.20-append

Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened
was the following:
1) File's original writer died
2) Recovery client tried to open file for append - looped for a minute or so
until soft lease expired, then append call initiated recovery
3) Recovery completed successfully
4) Recovery client calls append again, which succeeds on the NN
5) For some reason, the block recovery that happens at the start of append
pipeline creation failed on all datanodes 6 times, causing the append() call
to throw an exception back to HBase master. HBase assumed the file wasn't
open and put it back on a queue to try later
6) Some time later, it tried append again, but the lease was still assigned
to the same DFS client, so it wasn't able to recover.
The recovery failure in step 5 is a separate issue, but the problem for this
JIRA is that the NN can think it failed to open a file for append when the NN
thinks the writer holds a lease. Since the writer keeps renewing its lease,
recovery never happens, and no one can open or recover the file until the DFS
client shuts down.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table


[ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882307#action_12882307
 ] 

Suresh Srinivas commented on HDFS-1114:
---

+1 for the new patch.

 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Fix For: 0.22.0

 Attachments: benchmark20100618.patch, GSet20100525.pdf, 
 gset20100608.pdf, h1114_20100607.patch, h1114_20100614b.patch, 
 h1114_20100615.patch, h1114_20100616b.patch, h1114_20100617.patch, 
 h1114_20100617b.patch, h1114_20100617b2_y0.20.1xx.patch, 
 h1114_20100617b_y0.20.1xx.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1056) Multi-node RPC deadlocks during block recovery

2010-06-24 Thread Nicolas Spiegelberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Spiegelberg updated HDFS-1056:
--

Attachment: 0013-HDFS-1056.-Fix-possible-multinode-deadlocks-during-b.patch

added Todd's fix for 0.20-append.  no unit test yet

 Multi-node RPC deadlocks during block recovery
 --

 Key: HDFS-1056
 URL: https://issues.apache.org/jira/browse/HDFS-1056
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 0.20.2, 0.21.0, 0.22.0
Reporter: Todd Lipcon
 Fix For: 0.20-append

 Attachments: 
 0013-HDFS-1056.-Fix-possible-multinode-deadlocks-during-b.patch


 Believe it or not, I'm seeing HADOOP-3657 / HADOOP-3673 in a 5-node 0.20 
 cluster. I have many concurrent writes on the cluster, and when I kill a DN, 
 some percentage of the time I get one of these cross-node deadlocks among 3 
 of the nodes (replication 3). All of the DN RPC server threads are tied up 
 waiting on RPC clients to other datanodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-24 Thread Nicolas Spiegelberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Spiegelberg updated HDFS-1057:
--

Attachment: HDFS-1057-0.20-append.patch

fix of HDFS-1057 for 0.20-append branch (courtesy of Todd)

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Fix For: 0.20-append

 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
 hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN


 [ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash reassigned HDFS-1262:
--

Assignee: sam rash

 Failed pipeline creation during append leaves lease hanging on NN
 -

 Key: HDFS-1262
 URL: https://issues.apache.org/jira/browse/HDFS-1262
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client, name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Critical
 Fix For: 0.20-append


 Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
 was the following:
 1) File's original writer died
 2) Recovery client tried to open file for append - looped for a minute or so 
 until soft lease expired, then append call initiated recovery
 3) Recovery completed successfully
 4) Recovery client calls append again, which succeeds on the NN
 5) For some reason, the block recovery that happens at the start of append 
 pipeline creation failed on all datanodes 6 times, causing the append() call 
 to throw an exception back to HBase master. HBase assumed the file wasn't 
 open and put it back on a queue to try later
 6) Some time later, it tried append again, but the lease was still assigned 
 to the same DFS client, so it wasn't able to recover.
 The recovery failure in step 5 is a separate issue, but the problem for this 
 JIRA is that the NN can think it failed to open a file for append when the NN 
 thinks the writer holds a lease. Since the writer keeps renewing its lease, 
 recovery never happens, and no one can open or recover the file until the DFS 
 client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery

[
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882350#action_12882350
]

sam rash commented on HDFS-1186:

hey todd,

i was looking at this patch, and while it has certainly reduced the chance of
problems, isn't it still possible a new writer thread could be created

1. between the kill loop in startBlockRecovery() and the synchronized block
2. between the startBlockRecovery() call and updateBlock() call

I seem to recall reasoning with dhruba that while in theory these could occur
from the DN perspective, the circumstances that would have to occur outside
were not (once you fixed hdfs-1260 anyway, where genstamp checks work right in
concurrent lease recovery).

what's your take on this? is it full-proof now? (1 2 can't happen) or what
about introducing a state like RUR here? (at least disabling writes to a block
while under recovery, maybe timing out in case the lease recovery owner dies)

0.20: DNs should interrupt writers at start of recovery
---

Key: HDFS-1186
URL: https://issues.apache.org/jira/browse/HDFS-1186
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
Attachments: hdfs-1186.txt

When block recovery starts (eg due to NN recovering lease) it needs to
interrupt any writers currently writing to those blocks. Otherwise, an old
writer (who hasn't realized he lost his lease) can continue to write+sync to
the blocks, and thus recovery ends up truncating data that has been sync()ed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file


[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882359#action_12882359
 ] 

sam rash commented on HDFS-1057:


patch should use --no-prefix to get rid of 'a' and 'b' in paths, fyi


 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Fix For: 0.20-append

 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
 hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery


[ 
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882365#action_12882365
 ] 

Todd Lipcon commented on HDFS-1186:
---

Hi Sam, thanks for taking a look. I think you're right that in some really 
weird timing scenarios we might have a problem:

{noformat}
writer writes offset 1 and syncs, gs=1

NN recovery starts:
  - interrupts writer, gets metadata (len 1)
  - recovering DN hangs for a little bit

writer recovery starts, picks a different primary DN:
  - interrupts writer (noop)
  - gets metadata (len 1)
  - gets new GS=2
  - syncs blocks to GS=2 len=1
  - restarts pipeline
  - writes and syncs some more data to block with GS=2

NN-directed recovery proceeds:
  - gets new GS=3   (this has to be at least 10 seconds after above due to 
lastRecoveryTime check)
  - calls updateBlock on all DNs, which truncates files
{noformat}

I think the issue here is that the genstamp can be incremented in between 
startBlockRecovery() and updateBlock(), and thus updateBlock is allowing an 
update based on stale recovery info. If we simply added a check in 
tryUpdateBlock() that oldblock.getGenerationStamp() == oldgs, I think we'd be 
safe. What do you think?


 0.20: DNs should interrupt writers at start of recovery
 ---

 Key: HDFS-1186
 URL: https://issues.apache.org/jira/browse/HDFS-1186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hdfs-1186.txt


 When block recovery starts (eg due to NN recovering lease) it needs to 
 interrupt any writers currently writing to those blocks. Otherwise, an old 
 writer (who hasn't realized he lost his lease) can continue to write+sync to 
 the blocks, and thus recovery ends up truncating data that has been sync()ed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery

[
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882367#action_12882367
]

sam rash commented on HDFS-1186:

yea, i think so. let me repeat slightly different to make sure I get this at a
higher level:

1. we make sure that a lease recovery that starts with a old gs at one stage
(that's synchronized) actually mutates the block data of only the same gs
2. new writer that come in between start of recovery and actual stamping must
have a new gs since they can only come into being via lease recovery

this is effectively saying that if concurrent lease recoveries get started, the
first to complete wins (as it should), and later completions just fail.

sounds like optimistic locking/versioned puts in the cache world actually:
updateBlock requires the source to match an expected source.

nice idea

0.20: DNs should interrupt writers at start of recovery
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery

[
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882376#action_12882376
]

Todd Lipcon commented on HDFS-1186:
---

I'm wondering if there is still some sort of weird issue, though...
Let's say there are two concurrent recoveries, one trying to update to genstamp
2 and the other trying to update to genstamp 3, and there are 3 DNs.

Let's say that GS=2 recovery wins on dn A and B, and GS=3 recovery wins on DN
C, a little bit later. The commitBlockSynchronization() call for GS=3 works,
even though the client started writing again to GS=2.

It's almost as if we need to track through the lease holder name through the
block synchronization, and only allow nextGenerationStamp and
commitBlockSynchronization to succeed if the lease holder agrees?

0.20: DNs should interrupt writers at start of recovery
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery


[ 
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882386#action_12882386
 ] 

sam rash commented on HDFS-1186:


how could this happen?  the GS=2 stamp succeeds on A and B.  for GS=3 to win on 
C, GS=2 had to fail which means it went 2nd.  The primary for GS=2 would get a 
failure doing the stamping of DN C and would fail the lease recovery, right?

 0.20: DNs should interrupt writers at start of recovery
 ---

 Key: HDFS-1186
 URL: https://issues.apache.org/jira/browse/HDFS-1186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hdfs-1186.txt


 When block recovery starts (eg due to NN recovering lease) it needs to 
 interrupt any writers currently writing to those blocks. Otherwise, an old 
 writer (who hasn't realized he lost his lease) can continue to write+sync to 
 the blocks, and thus recovery ends up truncating data that has been sync()ed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery


[ 
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882388#action_12882388
 ] 

Todd Lipcon commented on HDFS-1186:
---

The lease recovery only fails if *all* replicas fail to updateBlock. As long as 
at least one updates, it succeeds and calls commitBlockSynchronization

 0.20: DNs should interrupt writers at start of recovery
 ---

 Key: HDFS-1186
 URL: https://issues.apache.org/jira/browse/HDFS-1186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hdfs-1186.txt


 When block recovery starts (eg due to NN recovering lease) it needs to 
 interrupt any writers currently writing to those blocks. Otherwise, an old 
 writer (who hasn't realized he lost his lease) can continue to write+sync to 
 the blocks, and thus recovery ends up truncating data that has been sync()ed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery


[ 
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882389#action_12882389
 ] 

sam rash commented on HDFS-1186:


I think you can make this argument:

1. each node has to make a transition from x - x+k
2. at most one node owns any x - x+k transition as the primary of a recovery
3. success requires all DNs to complete x - x+k
4. primary then commits x - x+k

and until commitBlockSync completes, no transition y - y+j with y  x can come 
in

right?


 0.20: DNs should interrupt writers at start of recovery
 ---

 Key: HDFS-1186
 URL: https://issues.apache.org/jira/browse/HDFS-1186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hdfs-1186.txt


 When block recovery starts (eg due to NN recovering lease) it needs to 
 interrupt any writers currently writing to those blocks. Otherwise, an old 
 writer (who hasn't realized he lost his lease) can continue to write+sync to 
 the blocks, and thus recovery ends up truncating data that has been sync()ed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1250) Namenode accepts block report from dead datanodes

[
https://issues.apache.org/jira/browse/HDFS-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Suresh Srinivas updated HDFS-1250:
--

Attachment: HDFS-1250.patch

The attached patch makes the following changes:
# Currently when an unknown datanode sends blockReport(), namenode rejects it
with an IOException. Same is done when a dead datanode sends block report.
# Currently when an unknown datanode sends blockReceived() request, namenode
rejects it with IllegalArgumentException. I am changing this to IOException, to
be consistent with blockReport(). Same IOException is thrown when a dead
datanode sends blockReceived(), IOException.
# I have added a new test to ensure the following requests are rejected from
dead datanodes:
#* blockReceived()
#* blockReport()
#* sendHeartBeat()

Namenode accepts block report from dead datanodes
-

Key: HDFS-1250
URL: https://issues.apache.org/jira/browse/HDFS-1250
Project: Hadoop HDFS
Issue Type: Bug
Components: name-node
Affects Versions: 0.20.2, 0.22.0
Reporter: Suresh Srinivas
Assignee: Suresh Srinivas
Attachments: HDFS-1250.patch

When a datanode heartbeat times out namenode marks it dead. The subsequent
heartbeat from the datanode is rejected with a command to datanode to
re-register. However namenode accepts block report from the datanode although
it is marked dead.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery


[ 
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882401#action_12882401
 ] 

sam rash commented on HDFS-1186:


hmm i wonder why only 1?  if the client thinks there are 3 DNs in the pipeline 
and asks to recovery 3, i think it should fail with less than 3.  a client can 
request fewer if that works (in which case we do have to handle the problem you 
lay out)

so in your sol'n, you are saying that the lease holder, the client, needs to be 
contacted to verify the primary is the only one doing lease recovery? (or at 
least the latest)


 0.20: DNs should interrupt writers at start of recovery
 ---

 Key: HDFS-1186
 URL: https://issues.apache.org/jira/browse/HDFS-1186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hdfs-1186.txt


 When block recovery starts (eg due to NN recovering lease) it needs to 
 interrupt any writers currently writing to those blocks. Otherwise, an old 
 writer (who hasn't realized he lost his lease) can continue to write+sync to 
 the blocks, and thus recovery ends up truncating data that has been sync()ed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1186) 0.20: DNs should interrupt writers at start of recovery


[ 
https://issues.apache.org/jira/browse/HDFS-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882404#action_12882404
 ] 

sam rash commented on HDFS-1186:


wait, why can't commitBlockSync on the NN just do the same check on genstamps?  
if two primaries start concurrent lease recoveries and split the remaining 
nodes as far as who wins in stamping, and the NN can resolve the issue of who 
wins in the end?   then the loser will be marked as an invalid and replication 
takes over to fix it

or i have this sinking feeling i am still missing something?

 0.20: DNs should interrupt writers at start of recovery
 ---

 Key: HDFS-1186
 URL: https://issues.apache.org/jira/browse/HDFS-1186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hdfs-1186.txt


 When block recovery starts (eg due to NN recovering lease) it needs to 
 interrupt any writers currently writing to those blocks. Otherwise, an old 
 writer (who hasn't realized he lost his lease) can continue to write+sync to 
 the blocks, and thus recovery ends up truncating data that has been sync()ed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-708) A stress-test tool for HDFS.


 [ 
https://issues.apache.org/jira/browse/HDFS-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-708:
-

Attachment: SLiveTest.pdf

Updating documentation adding multiple reducers per MAPREDUCE-1893.

 A stress-test tool for HDFS.
 

 Key: HDFS-708
 URL: https://issues.apache.org/jira/browse/HDFS-708
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: test, tools
Affects Versions: 0.22.0
Reporter: Konstantin Shvachko
Assignee: Joshua Harlow
 Fix For: 0.22.0

 Attachments: slive.patch, slive.patch.1, SLiveTest.pdf, 
 SLiveTest.pdf, SLiveTest.pdf


 It would be good to have a tool for automatic stress testing HDFS, which 
 would provide IO-intensive load on HDFS cluster.
 The idea is to start the tool, let it run overnight, and then be able to 
 analyze possible failures.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1256) libhdfs is missing from the tarball

2010-06-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882409#action_12882409
 ] 

Hudson commented on HDFS-1256:
--

Integrated in Hadoop-Hdfs-trunk-Commit #321 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/321/])


 libhdfs is missing from the tarball
 ---

 Key: HDFS-1256
 URL: https://issues.apache.org/jira/browse/HDFS-1256
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Tom White
Assignee: Tom White
Priority: Blocker
 Fix For: 0.21.0

 Attachments: HDFS-1256.patch, HDFS-1256.patch


 It is being compiled, but is not added to the distribution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1230) BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain


[ 
https://issues.apache.org/jira/browse/HDFS-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882425#action_12882425
 ] 

Konstantin Shvachko commented on HDFS-1230:
---

Thanks Gokul for clarifying. Makes sense now. This has already been changed in 
the trunk. So it would be only for 0.20.

 BlocksMap.blockinfo is not getting cleared immediately after deleting a 
 block.This will be cleared only after block report comes from the 
 datanode.Why we need to maintain the blockinfo till that time.
 

 Key: HDFS-1230
 URL: https://issues.apache.org/jira/browse/HDFS-1230
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.20.1
Reporter: Gokul

 BlocksMap.blockinfo is not getting cleared immediately after deleting a 
 block.This will be cleared only after block report comes from the 
 datanode.Why we need to maintain the blockinfo till that time It 
 increases namenode memory unnecessarily. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-729) fsck option to list only corrupted files


[ 
https://issues.apache.org/jira/browse/HDFS-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882426#action_12882426
 ] 

Konstantin Shvachko commented on HDFS-729:
--

Could you please update components and version fields for this jira.

 fsck option to list only corrupted files
 

 Key: HDFS-729
 URL: https://issues.apache.org/jira/browse/HDFS-729
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: dhruba borthakur
Assignee: Rodrigo Schmidt
 Attachments: badFiles.txt, badFiles2.txt, corruptFiles.txt, 
 HDFS-729.1.patch, HDFS-729.2.patch, HDFS-729.3.patch, HDFS-729.4.patch, 
 HDFS-729.5.patch, HDFS-729.6.patch


 An option to fsck to list only corrupted files will be very helpful for 
 frequent monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete