[jira] Commented: (HDFS-2) IllegalMonitorStateException in DataNode shutdown

2010-06-16 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879315#action_12879315
 ] 

Amareshwari Sriramadasu commented on HDFS-2:


Another occurrence of a test failure with same exception @ [failure 
|http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/573/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/].

 IllegalMonitorStateException in DataNode shutdown
 -

 Key: HDFS-2
 URL: https://issues.apache.org/jira/browse/HDFS-2
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Amareshwari Sriramadasu

 This was observed in one of the hudson runs, 
 org.apache.hadoop.mapred.TestBadRecords.testBadMapRed failed with following 
 exception trace:
 {noformat}
 java.lang.IllegalMonitorStateException
   at java.lang.Object.notifyAll(Native Method)
   at org.apache.hadoop.ipc.Server.stop(Server.java:1110)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:576)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:569)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:553)
   at 
 org.apache.hadoop.mapred.ClusterMapReduceTestCase.stopCluster(ClusterMapReduceTestCase.java:129)
   at 
 org.apache.hadoop.mapred.ClusterMapReduceTestCase.tearDown(ClusterMapReduceTestCase.java:140)
 {noformat}
 More logs @
 http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/testReport/org.apache.hadoop.mapred/TestBadRecords/testBadMapRed/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1213) Implement a VFS Driver for HDFS

2010-06-16 Thread Michael D'Amour (JIRA)
Implement a VFS Driver for HDFS
---

 Key: HDFS-1213
 URL: https://issues.apache.org/jira/browse/HDFS-1213
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Michael D'Amour


We have an open source ETL tool (Kettle) which uses VFS for many input/output 
steps/jobs.  We would like to be able to read/write HDFS from Kettle using VFS. 
 
 
I haven't been able to find anything out there other than it would be nice.
 
I had some time a few weeks ago to begin writing a VFS driver for HDFS and we 
(Pentaho) would like to be able to contribute this driver.  I believe it 
supports all the major file/folder operations and I have written unit tests for 
all of these operations.  The code is currently checked into an open Pentaho 
SVN repository under the Apache 2.0 license.  There are some current 
limitations, such as a lack of authentication (kerberos), which appears to be 
coming in 0.22.0, however, the driver supports username/password, but I just 
can't use them yet.

I will be attaching the code for the driver once the case is created.  The 
project does not modify existing hadoop/hdfs source.

Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1213) Implement a VFS Driver for HDFS

2010-06-16 Thread Michael D'Amour (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael D'Amour updated HDFS-1213:
--

Attachment: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz
pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar

Here's the source for our VFS driver.  

Also..  Attached is a build which can be dropped into any project classpath and 
if the project supports VFS, HDFS will be available simply by using HDFS urls 
such as hdfs://hostname:port/path/to/your/file.

 Implement a VFS Driver for HDFS
 ---

 Key: HDFS-1213
 URL: https://issues.apache.org/jira/browse/HDFS-1213
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Michael D'Amour
 Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, 
 pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar


 We have an open source ETL tool (Kettle) which uses VFS for many input/output 
 steps/jobs.  We would like to be able to read/write HDFS from Kettle using 
 VFS.  
  
 I haven't been able to find anything out there other than it would be nice.
  
 I had some time a few weeks ago to begin writing a VFS driver for HDFS and we 
 (Pentaho) would like to be able to contribute this driver.  I believe it 
 supports all the major file/folder operations and I have written unit tests 
 for all of these operations.  The code is currently checked into an open 
 Pentaho SVN repository under the Apache 2.0 license.  There are some current 
 limitations, such as a lack of authentication (kerberos), which appears to be 
 coming in 0.22.0, however, the driver supports username/password, but I just 
 can't use them yet.
 I will be attaching the code for the driver once the case is created.  The 
 project does not modify existing hadoop/hdfs source.
 Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1213) Implement a VFS Driver for HDFS

2010-06-16 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879437#action_12879437
 ] 

Allen Wittenauer commented on HDFS-1213:


Do you mean VFS as in the Linux virtual file system kernel API or some other 
VFS?

 Implement a VFS Driver for HDFS
 ---

 Key: HDFS-1213
 URL: https://issues.apache.org/jira/browse/HDFS-1213
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Michael D'Amour
 Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, 
 pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar


 We have an open source ETL tool (Kettle) which uses VFS for many input/output 
 steps/jobs.  We would like to be able to read/write HDFS from Kettle using 
 VFS.  
  
 I haven't been able to find anything out there other than it would be nice.
  
 I had some time a few weeks ago to begin writing a VFS driver for HDFS and we 
 (Pentaho) would like to be able to contribute this driver.  I believe it 
 supports all the major file/folder operations and I have written unit tests 
 for all of these operations.  The code is currently checked into an open 
 Pentaho SVN repository under the Apache 2.0 license.  There are some current 
 limitations, such as a lack of authentication (kerberos), which appears to be 
 coming in 0.22.0, however, the driver supports username/password, but I just 
 can't use them yet.
 I will be attaching the code for the driver once the case is created.  The 
 project does not modify existing hadoop/hdfs source.
 Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-06-16 Thread Nicolas Spiegelberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879439#action_12879439
 ] 

Nicolas Spiegelberg commented on HDFS-101:
--

Todd, you're assumption is correct.  I needed a couple small things from the 
HDFS-793 patch (namely, getNumOfReplies) to make HDFS-101 compatible with 
HDFS-872.

 DFS write pipeline : DFSClient sometimes does not detect second datanode 
 failure 
 -

 Key: HDFS-101
 URL: https://issues.apache.org/jira/browse/HDFS-101
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append, 0.20.1
Reporter: Raghu Angadi
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.20.2, 0.21.0

 Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
 detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
 detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, 
 HDFS-101_20-append.patch, pipelineHeartbeat.patch, 
 pipelineHeartbeat_yahoo.patch


 When the first datanode's write to second datanode fails or times out 
 DFSClient ends up marking first datanode as the bad one and removes it from 
 the pipeline. Similar problem exists on DataNode as well and it is fixed in 
 HADOOP-3339. From HADOOP-3339 : 
 The main issue is that BlockReceiver thread (and DataStreamer in the case of 
 DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
 coarse control. We don't know what state the responder is in and interrupting 
 has different effects depending on responder state. To fix this properly we 
 need to redesign how we handle these interactions.
 When the first datanode closes its socket from DFSClient, DFSClient should 
 properly read all the data left in the socket.. Also, DataNode's closing of 
 the socket should not result in a TCP reset, otherwise I think DFSClient will 
 not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1213) Implement a VFS Driver for HDFS

2010-06-16 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879447#action_12879447
 ] 

Arun C Murthy commented on HDFS-1213:
-

Michael, could you please upload this as a patch rather than a tarball?

 Implement a VFS Driver for HDFS
 ---

 Key: HDFS-1213
 URL: https://issues.apache.org/jira/browse/HDFS-1213
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Michael D'Amour
 Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, 
 pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar


 We have an open source ETL tool (Kettle) which uses VFS for many input/output 
 steps/jobs.  We would like to be able to read/write HDFS from Kettle using 
 VFS.  
  
 I haven't been able to find anything out there other than it would be nice.
  
 I had some time a few weeks ago to begin writing a VFS driver for HDFS and we 
 (Pentaho) would like to be able to contribute this driver.  I believe it 
 supports all the major file/folder operations and I have written unit tests 
 for all of these operations.  The code is currently checked into an open 
 Pentaho SVN repository under the Apache 2.0 license.  There are some current 
 limitations, such as a lack of authentication (kerberos), which appears to be 
 coming in 0.22.0, however, the driver supports username/password, but I just 
 can't use them yet.
 I will be attaching the code for the driver once the case is created.  The 
 project does not modify existing hadoop/hdfs source.
 Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1213) Implement a VFS Driver for HDFS

2010-06-16 Thread Michael D'Amour (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879448#action_12879448
 ] 

Michael D'Amour commented on HDFS-1213:
---

Allen-  Sorry for any confusion, I am referring to Apache VFS 
(http://commons.apache.org/vfs/)

 Implement a VFS Driver for HDFS
 ---

 Key: HDFS-1213
 URL: https://issues.apache.org/jira/browse/HDFS-1213
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Michael D'Amour
 Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, 
 pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar


 We have an open source ETL tool (Kettle) which uses VFS for many input/output 
 steps/jobs.  We would like to be able to read/write HDFS from Kettle using 
 VFS.  
  
 I haven't been able to find anything out there other than it would be nice.
  
 I had some time a few weeks ago to begin writing a VFS driver for HDFS and we 
 (Pentaho) would like to be able to contribute this driver.  I believe it 
 supports all the major file/folder operations and I have written unit tests 
 for all of these operations.  The code is currently checked into an open 
 Pentaho SVN repository under the Apache 2.0 license.  There are some current 
 limitations, such as a lack of authentication (kerberos), which appears to be 
 coming in 0.22.0, however, the driver supports username/password, but I just 
 can't use them yet.
 I will be attaching the code for the driver once the case is created.  The 
 project does not modify existing hadoop/hdfs source.
 Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-101:
-

Attachment: hdfs-101-branch-0.20-append-cdh3.txt

Hey Nicolas,

I just compared our two patches side by side. The one I've been testing with 
(and made a noticeable improvement in recovery detecting the correct down node 
in cluster failure testing) is attached. Here are a few differences I noticed 
(though maybe because the diffs are against different trees):

- Looks like your patch doesn't maintain wire compat when mirrorError is true, 
since it constructs a replies list with only 2 elements (not based on the 
number of downstream nodes)
- When receiving packets in BlockReceiver, I am explicitly forwarding 
HEART_BEAT packets where it looks like you're not checking for them. Have you 
verified by leaving a connection open with no data flowing that heartbeats are 
handled properly in BlockReceiver?

 DFS write pipeline : DFSClient sometimes does not detect second datanode 
 failure 
 -

 Key: HDFS-101
 URL: https://issues.apache.org/jira/browse/HDFS-101
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append, 0.20.1
Reporter: Raghu Angadi
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.20.2, 0.21.0

 Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
 detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
 detectDownDN3-0.20.patch, detectDownDN3.patch, 
 hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, 
 HDFS-101_20-append.patch, pipelineHeartbeat.patch, 
 pipelineHeartbeat_yahoo.patch


 When the first datanode's write to second datanode fails or times out 
 DFSClient ends up marking first datanode as the bad one and removes it from 
 the pipeline. Similar problem exists on DataNode as well and it is fixed in 
 HADOOP-3339. From HADOOP-3339 : 
 The main issue is that BlockReceiver thread (and DataStreamer in the case of 
 DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
 coarse control. We don't know what state the responder is in and interrupting 
 has different effects depending on responder state. To fix this properly we 
 need to redesign how we handle these interactions.
 When the first datanode closes its socket from DFSClient, DFSClient should 
 properly read all the data left in the socket.. Also, DataNode's closing of 
 the socket should not result in a TCP reset, otherwise I think DFSClient will 
 not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1213) Implement an Apache Commons VFS Driver for HDFS

2010-06-16 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated HDFS-1213:
---

Summary: Implement an Apache Commons VFS Driver for HDFS  (was: Implement a 
VFS Driver for HDFS)

 Implement an Apache Commons VFS Driver for HDFS
 ---

 Key: HDFS-1213
 URL: https://issues.apache.org/jira/browse/HDFS-1213
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Michael D'Amour
 Attachments: pentaho-hdfs-vfs-TRUNK-SNAPSHOT-sources.tar.gz, 
 pentaho-hdfs-vfs-TRUNK-SNAPSHOT.jar


 We have an open source ETL tool (Kettle) which uses VFS for many input/output 
 steps/jobs.  We would like to be able to read/write HDFS from Kettle using 
 VFS.  
  
 I haven't been able to find anything out there other than it would be nice.
  
 I had some time a few weeks ago to begin writing a VFS driver for HDFS and we 
 (Pentaho) would like to be able to contribute this driver.  I believe it 
 supports all the major file/folder operations and I have written unit tests 
 for all of these operations.  The code is currently checked into an open 
 Pentaho SVN repository under the Apache 2.0 license.  There are some current 
 limitations, such as a lack of authentication (kerberos), which appears to be 
 coming in 0.22.0, however, the driver supports username/password, but I just 
 can't use them yet.
 I will be attaching the code for the driver once the case is created.  The 
 project does not modify existing hadoop/hdfs source.
 Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-16 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879454#action_12879454
 ] 

Hairong Kuang commented on HDFS-1057:
-

Sam, the patch is in good shape. Thanks for working on this. A few minor 
comments: 
1. ReplicaBeingWritten.java: dataLength and bytesOnDisk are the same, right? We 
do not need to introduce another field dataLength. I am also hesitate to delare 
datalength  lastchecksum as volatible. Accesses to them are already 
synchronized  and the norm case is that writing without reading. 
2. We probably should remove setBytesOnDisk from ReplicaInPipelineInterface  
ReplicaInPipeline.

 In 0.20, I made it so that client just treats this as a 0-length file. one of 
 our internal tools saw this rather frequently in 0.20.
Good to know this. Then could you please handle this case in the trunk the same 
as well? Thanks again, Sam.

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-16 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879460#action_12879460
 ] 

sam rash commented on HDFS-1057:


1. they aren't guaranteed to be since there are methods to change the 
bytesOnDisk separate from the lastCheckSum bytes.  It's entirely conceivable 
that something could update the bytes on disk w/o updating the lastChecksum 
with the current set of methods

If we are ok with a loosely coupled guarantee, then we can use bytesOnDisk and 
be careful never to call setBytesOnDisk() for any RBW

2. oh--your previous comments indicated we shouldn't change either 
ReplicaInPipelineInterface or ReplicaInPipeline.  If that's not the case and we 
can do this, then my comment above doesn't hold.  we use bytesOnDisk and 
guarantee it's in sync with the checksum in a single synchronized method (I 
like this)

3. will make the update to treat missing last blocks as 0-length and re-instate 
the unit test.

thanks for all the help on this

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1214) hdfs client metadata cache

2010-06-16 Thread Joydeep Sen Sarma (JIRA)
hdfs client metadata cache
--

 Key: HDFS-1214
 URL: https://issues.apache.org/jira/browse/HDFS-1214
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Joydeep Sen Sarma


In some applications, latency is affected by the cost of making rpc calls to 
namenode to fetch metadata. the most obvious case are calls to fetch 
file/directory status. applications like hive like to make optimizations based 
on file size/number etc. - and for such optimizations - 'recent' status data 
(as opposed to most up-to-date) is acceptable. in such cases, a cache on the 
DFS client that transparently caches metadata would be greatly benefit 
applications.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1212) Harmonize HDFS JAR library versions with Common

2010-06-16 Thread Tom White (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated HDFS-1212:


Attachment: HDFS-1212.patch

 Harmonize HDFS JAR library versions with Common
 ---

 Key: HDFS-1212
 URL: https://issues.apache.org/jira/browse/HDFS-1212
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: build
Reporter: Tom White
Assignee: Tom White
Priority: Blocker
 Fix For: 0.21.0

 Attachments: HDFS-1212.patch


 HDFS part of HADOOP-6800.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1214) hdfs client metadata cache

2010-06-16 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879466#action_12879466
 ] 

Joydeep Sen Sarma commented on HDFS-1214:
-

while a cache can be maintained on the application side - it's harder and it 
seems like the wrong place to implement the same. in the case of a query 
compiler - different compilation stages may be fetching metadata to figure out 
cost of query. furthermore, different queries may be compiled from the same jvm 
that end up requesting metadata for the same objects. 

the application can identify calls that can deal with out of date metadata. (so 
a separate api or an overlaid filesystem driver  or additional flags in the 
current api are all acceptable)

ideally the cache should be write-through (it's very common for a single jvm to 
be reading/writing the same fs object repeatedly).

 hdfs client metadata cache
 --

 Key: HDFS-1214
 URL: https://issues.apache.org/jira/browse/HDFS-1214
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Joydeep Sen Sarma

 In some applications, latency is affected by the cost of making rpc calls to 
 namenode to fetch metadata. the most obvious case are calls to fetch 
 file/directory status. applications like hive like to make optimizations 
 based on file size/number etc. - and for such optimizations - 'recent' status 
 data (as opposed to most up-to-date) is acceptable. in such cases, a cache on 
 the DFS client that transparently caches metadata would be greatly benefit 
 applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-16 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879479#action_12879479
 ] 

Hairong Kuang commented on HDFS-1057:
-

 they aren't guaranteed to be since there are methods to change the 
 bytesOnDisk separate from the lastCheckSum bytes.
I do not see any place that updates bytesOnDisk except for BlockReceiver. 
That's why I suggested to remove setBytesOnDisk from ReplicaInPipeline etc.

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-101:
--

Fix Version/s: 0.20-append

 DFS write pipeline : DFSClient sometimes does not detect second datanode 
 failure 
 -

 Key: HDFS-101
 URL: https://issues.apache.org/jira/browse/HDFS-101
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append, 0.20.1
Reporter: Raghu Angadi
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.20-append, 0.20.2, 0.21.0

 Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
 detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
 detectDownDN3-0.20.patch, detectDownDN3.patch, 
 hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, 
 HDFS-101_20-append.patch, pipelineHeartbeat.patch, 
 pipelineHeartbeat_yahoo.patch


 When the first datanode's write to second datanode fails or times out 
 DFSClient ends up marking first datanode as the bad one and removes it from 
 the pipeline. Similar problem exists on DataNode as well and it is fixed in 
 HADOOP-3339. From HADOOP-3339 : 
 The main issue is that BlockReceiver thread (and DataStreamer in the case of 
 DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
 coarse control. We don't know what state the responder is in and interrupting 
 has different effects depending on responder state. To fix this properly we 
 need to redesign how we handle these interactions.
 When the first datanode closes its socket from DFSClient, DFSClient should 
 properly read all the data left in the socket.. Also, DataNode's closing of 
 the socket should not result in a TCP reset, otherwise I think DFSClient will 
 not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-445) pread() fails when cached block locations are no longer valid

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-445:
--

Fix Version/s: 0.20-append

 pread() fails when cached block locations are no longer valid
 -

 Key: HDFS-445
 URL: https://issues.apache.org/jira/browse/HDFS-445
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append
Reporter: Kan Zhang
Assignee: Kan Zhang
 Fix For: 0.20-append, 0.21.0

 Attachments: 445-06.patch, 445-08.patch, hdfs-445-0.20-append.txt, 
 HDFS-445-0_20.2.patch


 when cached block locations are no longer valid (e.g., datanodes restart on 
 different ports), pread() will fail, whereas normal read() still succeeds 
 through re-fetching of block locations from namenode (up to a max number of 
 times). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-457) better handling of volume failure in Data Node storage

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-457:
-

Affects Version/s: (was: 0.20-append)

Removing 0.20-append tag - this isn't append-specific.

 better handling of volume failure in Data Node storage
 --

 Key: HDFS-457
 URL: https://issues.apache.org/jira/browse/HDFS-457
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Reporter: Boris Shkolnik
Assignee: Boris Shkolnik
 Fix For: 0.21.0

 Attachments: HDFS-457-1.patch, HDFS-457-2.patch, HDFS-457-2.patch, 
 HDFS-457-2.patch, HDFS-457-3.patch, HDFS-457.patch, HDFS-457_20-append.patch, 
 jira.HDFS-457.branch-0.20-internal.patch, TestFsck.zip


 Current implementation shuts DataNode down completely when one of the 
 configured volumes of the storage fails.
 This is rather wasteful behavior because it  decreases utilization (good 
 storage becomes unavailable) and imposes extra load on the system 
 (replication of the blocks from the good volumes). These problems will become 
 even more prominent when we move to mixed (heterogeneous) clusters with many 
 more volumes per Data Node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1215) TestNodeCount infinite loops on branch-20-append

2010-06-16 Thread Todd Lipcon (JIRA)
TestNodeCount infinite loops on branch-20-append


 Key: HDFS-1215
 URL: https://issues.apache.org/jira/browse/HDFS-1215
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
 Fix For: 0.20-append


HDFS-409 made some minicluster changes, which got incorporated into one of the 
earlier 20-append patches. This breaks TestNodeCount so it infinite loops on 
the branch. This patch fixes it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1215) TestNodeCount infinite loops on branch-20-append

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1215:
--

Attachment: 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch

Here's a -p1 patch that fixes this issue.

 TestNodeCount infinite loops on branch-20-append
 

 Key: HDFS-1215
 URL: https://issues.apache.org/jira/browse/HDFS-1215
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
 Fix For: 0.20-append

 Attachments: 
 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch


 HDFS-409 made some minicluster changes, which got incorporated into one of 
 the earlier 20-append patches. This breaks TestNodeCount so it infinite loops 
 on the branch. This patch fixes it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-988) saveNamespace can corrupt edits log

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-988:
-

Fix Version/s: 0.20-append

Marking this as fixed for the append branch (it's committed there, but not 
resolved for trunk yet)

 saveNamespace can corrupt edits log
 ---

 Key: HDFS-988
 URL: https://issues.apache.org/jira/browse/HDFS-988
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: dhruba borthakur
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.20-append

 Attachments: hdfs-988.txt, saveNamespace.txt, 
 saveNamespace_20-append.patch


 The adminstrator puts the namenode is safemode and then issues the 
 savenamespace command. This can corrupt the edits log. The problem is that  
 when the NN enters safemode, there could still be pending logSycs occuring 
 from other threads. Now, the saveNamespace command, when executed, would save 
 a edits log with partial writes. I have seen this happen on 0.20.
 https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12828853page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12828853

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1215) TestNodeCount infinite loops on branch-20-append

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1215.
---

  Assignee: Todd Lipcon
Resolution: Fixed

Dhruba committed to 20-append branch

 TestNodeCount infinite loops on branch-20-append
 

 Key: HDFS-1215
 URL: https://issues.apache.org/jira/browse/HDFS-1215
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: 
 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch


 HDFS-409 made some minicluster changes, which got incorporated into one of 
 the earlier 20-append patches. This breaks TestNodeCount so it infinite loops 
 on the branch. This patch fixes it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread Todd Lipcon (JIRA)
Update to JUnit 4 in branch 20 append
-

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append


A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879513#action_12879513
 ] 

dhruba borthakur commented on HDFS-1216:


+1

 Update to JUnit 4 in branch 20 append
 -

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: junit-4.5.txt


 A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
 Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1216:
--

Attachment: junit-4.5.txt

Update to junit 4.5 (it's not the newest, but it's what we use in trunk, so we 
should be consistent)

 Update to JUnit 4 in branch 20 append
 -

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: junit-4.5.txt


 A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
 Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1216:
--

Attachment: junit-4.5.txt

Ah, uploaded wrong file. Take 2.

 Update to JUnit 4 in branch 20 append
 -

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: junit-4.5.txt


 A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
 Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1216:
--

Attachment: (was: junit-4.5.txt)

 Update to JUnit 4 in branch 20 append
 -

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: junit-4.5.txt


 A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
 Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1216.


Resolution: Fixed

I just committed this. Thanks Todd!

 Update to JUnit 4 in branch 20 append
 -

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: junit-4.5.txt


 A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
 Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1054) Remove unnecessary sleep after failure in nextBlockOutputStream

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1054:
--

Fix Version/s: 0.20-append

 Remove unnecessary sleep after failure in nextBlockOutputStream
 ---

 Key: HDFS-1054
 URL: https://issues.apache.org/jira/browse/HDFS-1054
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs client
Affects Versions: 0.20-append, 0.20.3, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append, 0.21.0

 Attachments: hdfs-1054-0.20-append.txt, hdfs-1054.txt, hdfs-1054.txt


 If DFSOutputStream fails to create a pipeline, it currently sleeps 6 seconds 
 before retrying. I don't see a great reason to wait at all, much less 6 
 seconds (especially now that HDFS-630 ensures that a retry won't go back to 
 the bad node). We should at least make it configurable, and perhaps something 
 like backoff makes some sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879524#action_12879524
 ] 

Tom White commented on HDFS-1216:
-

HADOOP-6800 will upgrade to JUnit 4.8.1, so perhaps you'd like to use that.

 Update to JUnit 4 in branch 20 append
 -

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: junit-4.5.txt


 A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
 Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization

2010-06-16 Thread Todd Lipcon (JIRA)
20 append: Blocks recovered on startup should be treated with lower priority 
during block synchronization
-

 Key: HDFS-1218
 URL: https://issues.apache.org/jira/browse/HDFS-1218
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


When a datanode experiences power loss, it can come back up with truncated 
replicas (due to local FS journal replay). Those replicas should not be allowed 
to truncate the block during block synchronization if there are other replicas 
from DNs that have _not_ restarted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1141) completeFile does not check lease ownership

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-1141:
---

Fix Version/s: 0.20-append

 completeFile does not check lease ownership
 ---

 Key: HDFS-1141
 URL: https://issues.apache.org/jira/browse/HDFS-1141
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.20-append, 0.22.0

 Attachments: hdfs-1141-branch20.txt, hdfs-1141.txt, hdfs-1141.txt


 completeFile should check that the caller still owns the lease of the file 
 that it's completing. This is for the 'testCompleteOtherLeaseHoldersFile' 
 case in HDFS-1139.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1218:
--

Attachment: hdfs-1281.txt

Here's a patch, but won't apply on top of the branch currently. Requires 
HDFS-1057 and possibly some other FSDataset patches first to apply without 
conflict (possibly HDFS-1056)

 20 append: Blocks recovered on startup should be treated with lower priority 
 during block synchronization
 -

 Key: HDFS-1218
 URL: https://issues.apache.org/jira/browse/HDFS-1218
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append

 Attachments: hdfs-1281.txt


 When a datanode experiences power loss, it can come back up with truncated 
 replicas (due to local FS journal replay). Those replicas should not be 
 allowed to truncate the block during block synchronization if there are other 
 replicas from DNs that have _not_ restarted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1194) Secondary namenode fails to fetch the image from the primary

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1194:
--

Affects Version/s: (was: 0.20-append)

Removing append tag, since it's unrelated.

 Secondary namenode fails to fetch the image from the primary
 

 Key: HDFS-1194
 URL: https://issues.apache.org/jira/browse/HDFS-1194
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
 Environment: Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
 Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)
 CentOS 5
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov

 We just hit the problem described in HDFS-1024 again.
 After more investigation of the underlying problems with 
 CancelledKeyException there are some findings:
 One of the symptoms: the transfer becomes really slow (it does 700 kb/s) when 
 I am doing the fetch using wget. At the same time disk and network are OK 
 since I can copy at 50 mb/s using scp.
 I was taking jstacks of the namenode while the transfer is in process and we 
 found that every stack trace has one thread of jetty sitting in this place:
 {code}
java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
 org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:452)
   at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
   at 
 org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
   at 
 org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
   at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
 {code}
 Here is a jetty code that corresponds to this:
 {code}
 // Look for JVM bug 
 if (selected==0  wait0  (now-before)wait/2  
 _selector.selectedKeys().size()==0)
 {
 if (_jvmBug++5)  // TODO tune or configure this
 {
 // Probably JVM BUG!
 
 Iterator iter = _selector.keys().iterator();
 while(iter.hasNext())
 {
 key = (SelectionKey) iter.next();
 if (key.isValid()key.interestOps()==0)
 {
 key.cancel();
 }
 }
 try
 {
 Thread.sleep(20);  // tune or configure this
 }
 catch (InterruptedException e)
 {
 Log.ignore(e);
 }
 } 
 }
 {code}
 Based on this it is obvious we are hitting a jetty workaround for a JVM bug 
 that doesn't handle select() properly.
 There is a jetty JIRA for this http://jira.codehaus.org/browse/JETTY-937 (it 
 actually introduces the workaround for the JVM bug that we are hitting)
 They say that the problem was fixed in 6.1.22, there is a person on that JIRA 
 also saying that switching to using SocketConnector instead of 
 SelectChannelConnector helped in their case.
 Since we are hitting the same bug in our world we should either adopt the 
 newer Jetty version where there is a better workaround, but it might not help 
 if we are still hitting that bug constantly, the workaround might be better 
 though.
 Another approach is to switch to using SocketConnector which will eliminate 
 the problem completely, although I am not sure what problems that will bring.
 The java version we are running is in Environment
 Any thoughts

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1207) 0.20-append: stallReplicationWork should be volatile

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1207:
--

Attachment: hdfs-1207.txt

 0.20-append: stallReplicationWork should be volatile
 

 Key: HDFS-1207
 URL: https://issues.apache.org/jira/browse/HDFS-1207
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hdfs-1207.txt


 the stallReplicationWork member in FSNamesystem is accessed by multiple 
 threads without synchronization, but isn't marked volatile. I believe this is 
 responsible for about 1% failure rate on 
 TestFileAppend4.testAppendSyncChecksum* on my 8-core test boxes (looking at 
 logs I see replication happening even though we've supposedly disabled it)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-142) In 0.20, move blocks being written into a blocksBeingWritten directory

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-142.
---

Resolution: Fixed

I have committed this. Thanks Sam, Nicolas and Todd.

 In 0.20, move blocks being written into a blocksBeingWritten directory
 --

 Key: HDFS-142
 URL: https://issues.apache.org/jira/browse/HDFS-142
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append
Reporter: Raghu Angadi
Assignee: dhruba borthakur
Priority: Blocker
 Fix For: 0.20-append

 Attachments: appendFile-recheck-lease.txt, appendQuestions.txt, 
 deleteTmp.patch, deleteTmp2.patch, deleteTmp5_20.txt, deleteTmp5_20.txt, 
 deleteTmp_0.18.patch, dont-recover-rwr-when-rbw-available.txt, 
 handleTmp1.patch, hdfs-142-commitBlockSynchronization-unknown-datanode.txt, 
 HDFS-142-deaddn-fix.patch, HDFS-142-finalize-fix.txt, 
 hdfs-142-minidfs-fix-from-409.txt, 
 HDFS-142-multiple-blocks-datanode-exception.patch, 
 hdfs-142-recovery-reassignment-and-bbw-cleanup.txt, hdfs-142-testcases.txt, 
 hdfs-142-testleaserecovery-fix.txt, HDFS-142_20-append2.patch, 
 HDFS-142_20.patch, recentInvalidateSets-assertion-fix.txt, 
 recover-rbw-v2.txt, testfileappend4-deaddn.txt, 
 validateBlockMetaData-synchronized.txt


 Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp  
 directory since these files are not valid anymore. But in 0.18 it moves these 
 files to normal directory incorrectly making them valid blocks. One of the 
 following would work :
 - remove the tmp files during upgrade, or
 - if the files under /tmp are in pre-18 format (i.e. no generation), delete 
 them.
 Currently effect of this bug is that, these files end up failing block 
 verification and eventually get deleted. But cause incorrect over-replication 
 at the namenode before that.
 Also it looks like our policy regd treating files under tmp needs to be 
 defined better. Right now there are probably one or two more bugs with it. 
 Dhruba, please file them if you rememeber.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1141) completeFile does not check lease ownership

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1141.


Resolution: Fixed

Pulled into hadoop-0.20-append

 completeFile does not check lease ownership
 ---

 Key: HDFS-1141
 URL: https://issues.apache.org/jira/browse/HDFS-1141
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.20-append, 0.22.0

 Attachments: hdfs-1141-branch20.txt, hdfs-1141.txt, hdfs-1141.txt


 completeFile should check that the caller still owns the lease of the file 
 that it's completing. This is for the 'testCompleteOtherLeaseHoldersFile' 
 case in HDFS-1139.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879563#action_12879563
 ] 

Todd Lipcon commented on HDFS-1057:
---

[for branch 0.20 append, +1 -- I've been running with this for 6 weeks, it 
works, and looks good!]

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, hdfs-1057-trunk-3.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table

2010-06-16 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879582#action_12879582
 ] 

Suresh Srinivas commented on HDFS-1114:
---

# BlocksMap.java 
#* typo exponient. Should be exponent?
#* Capacity should be divided by a reference size 8 or 4 depending on the 64bit 
or 32bit java version
#* Current capacity calculation seems quite complex. Add more explanation on 
why it is implemented that way.
# LightWeightGSet.java
#* which uses a hash table for storing the elements should this say uses 
array?
#* Add a comment that the size of entries is power of two
#* Throw HadoopIllegalArgumentException instead of IllegalArgumentException 
(for 20 version of the patch it could remain IllegalArugmentException)
#* remove() - for better readability no need for else if and else since the 
previous block returns
#* toString() - prints the all the entries. This is a bad idea if some one 
passes this object to Log unknowingly. If all the details of the HashMap is 
needed, we should have some other method such as dump() or printDetails() to do 
the same.
# TestGSet.java
#* In exception tests, instead of printing log when expected exception 
happened, print a log in Assert.fail(), like Assert.fail(Excepected exception 
was not thrown). Check for exceptions should be more specific, instead 
Exception. It is also good idea to document these exceptions in javadoc for 
methods in GSet.
#* println should use Log.info instead of System.out.println?
#* add some comments to classes on what they do/how they are used
#* add some comments to GSetTestCase members denominator etc. and constructor
#* add comments to testGSet() on what each of the case is accomplishing 



 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Attachments: GSet20100525.pdf, gset20100608.pdf, 
 h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1000) libhdfs needs to be updated to use the new UGI

2010-06-16 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879584#action_12879584
 ] 

Suresh Srinivas commented on HDFS-1000:
---

+1 patch looks good.

 libhdfs needs to be updated to use the new UGI
 --

 Key: HDFS-1000
 URL: https://issues.apache.org/jira/browse/HDFS-1000
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.21.0, 0.22.0
Reporter: Devaraj Das
Assignee: Devaraj Das
Priority: Blocker
 Fix For: 0.21.0, 0.22.0

 Attachments: fs-javadoc.patch, hdfs-1000-bp20.3.patch, 
 hdfs-1000-bp20.4.patch, hdfs-1000-bp20.patch, hdfs-1000-trunk.1.patch


 libhdfs needs to be updated w.r.t the APIs in the new UGI.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table

2010-06-16 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-1114:
-

Status: Open  (was: Patch Available)

Thanks for the detail review, Suresh.

   1.  BlocksMap.java

done.

   2. LightWeightGSet.java

done all except the follwoing.

  * remove() - for better readability ...

Implicit else is better the explicit else?

   3. TestGSet.java
  * In exception tests, ...

Catching specific exceptions but I did not change the messages.

  * println should use Log.info instead of System.out.println?

No, print(..) and println(..) work together.

  * add some comments to ...
  * add some comments to ...
  * add comments to ...

Added more some comments.

 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Attachments: GSet20100525.pdf, gset20100608.pdf, 
 h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, 
 h1114_20100616b.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1114) Reducing NameNode memory usage by an alternate hash table

2010-06-16 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-1114:
-

Status: Patch Available  (was: Open)

Hudson does not seem working.  It did not pick up my previous for a long time.  
Re-submit.

 Reducing NameNode memory usage by an alternate hash table
 -

 Key: HDFS-1114
 URL: https://issues.apache.org/jira/browse/HDFS-1114
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
 Attachments: GSet20100525.pdf, gset20100608.pdf, 
 h1114_20100607.patch, h1114_20100614b.patch, h1114_20100615.patch, 
 h1114_20100616b.patch


 NameNode uses a java.util.HashMap to store BlockInfo objects.  When there are 
 many blocks in HDFS, this map uses a lot of memory in the NameNode.  We may 
 optimize the memory usage by a light weight hash table implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1206) TestFiHFlush depends on BlocksMap implementation

2010-06-16 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879615#action_12879615
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1206:
--

Saw it failing again.
{noformat}
Testcase: hFlushFi01_a took 4.553 sec
FAILED

junit.framework.AssertionFailedError: 
at 
org.apache.hadoop.hdfs.TestFiHFlush.runDiskErrorTest(TestFiHFlush.java:56)
at 
org.apache.hadoop.hdfs.TestFiHFlush.hFlushFi01_a(TestFiHFlush.java:72)
{noformat}

 TestFiHFlush depends on BlocksMap implementation
 

 Key: HDFS-1206
 URL: https://issues.apache.org/jira/browse/HDFS-1206
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Tsz Wo (Nicholas), SZE

 When I was testing HDFS-1114, the patch passed all tests except TestFiHFlush. 
  Then, I tried to print out some debug messages, however, TestFiHFlush 
 succeeded after added the messages.
 TestFiHFlush probably depends on the speed of BlocksMap.  If BlocksMap is 
 slow enough, then it will pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1204) 0.20: Lease expiration should recover single files, not entire lease holder

2010-06-16 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879616#action_12879616
 ] 

dhruba borthakur commented on HDFS-1204:


Sam/Todd : can you pl comment on whether this bug exists in Hadoop trunk?

 0.20: Lease expiration should recover single files, not entire lease holder
 ---

 Key: HDFS-1204
 URL: https://issues.apache.org/jira/browse/HDFS-1204
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: sam rash
 Fix For: 0.20-append

 Attachments: hdfs-1204.txt, hdfs-1204.txt


 This was brought up in HDFS-200 but didn't make it into the branch on Apache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1209) Add conf dfs.client.block.recovery.retries to configure number of block recovery attempts

2010-06-16 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879617#action_12879617
 ] 

dhruba borthakur commented on HDFS-1209:


Thsi one does not apply cleanly to 0.20-append. Can you pl post a new patch?

 Add conf dfs.client.block.recovery.retries to configure number of block 
 recovery attempts
 -

 Key: HDFS-1209
 URL: https://issues.apache.org/jira/browse/HDFS-1209
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs client
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hdfs-1209.txt


 This variable is referred to in the TestFileAppend4 tests, but it isn't 
 actually looked at by DFSClient (I'm betting this is in FB's branch).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1204) 0.20: Lease expiration should recover single files, not entire lease holder

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879618#action_12879618
 ] 

Todd Lipcon commented on HDFS-1204:
---

I think it does not - it looks like it was a regression caused by HDFS-200 in 
branch 20 append.

 0.20: Lease expiration should recover single files, not entire lease holder
 ---

 Key: HDFS-1204
 URL: https://issues.apache.org/jira/browse/HDFS-1204
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: sam rash
 Fix For: 0.20-append

 Attachments: hdfs-1204.txt, hdfs-1204.txt


 This was brought up in HDFS-200 but didn't make it into the branch on Apache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1209) Add conf dfs.client.block.recovery.retries to configure number of block recovery attempts

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879620#action_12879620
 ] 

Todd Lipcon commented on HDFS-1209:
---

This should apply after HDFS-1210 - can you commit that one first?

 Add conf dfs.client.block.recovery.retries to configure number of block 
 recovery attempts
 -

 Key: HDFS-1209
 URL: https://issues.apache.org/jira/browse/HDFS-1209
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs client
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hdfs-1209.txt


 This variable is referred to in the TestFileAppend4 tests, but it isn't 
 actually looked at by DFSClient (I'm betting this is in FB's branch).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1207) 0.20-append: stallReplicationWork should be volatile

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1207.


Fix Version/s: 0.20-append
   Resolution: Fixed

I just committed this. Thanks Todd!

 0.20-append: stallReplicationWork should be volatile
 

 Key: HDFS-1207
 URL: https://issues.apache.org/jira/browse/HDFS-1207
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: hdfs-1207.txt


 the stallReplicationWork member in FSNamesystem is accessed by multiple 
 threads without synchronization, but isn't marked volatile. I believe this is 
 responsible for about 1% failure rate on 
 TestFileAppend4.testAppendSyncChecksum* on my 8-core test boxes (looking at 
 logs I see replication happening even though we've supposedly disabled it)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1204) 0.20: Lease expiration should recover single files, not entire lease holder

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1204.


Resolution: Fixed

 0.20: Lease expiration should recover single files, not entire lease holder
 ---

 Key: HDFS-1204
 URL: https://issues.apache.org/jira/browse/HDFS-1204
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: sam rash
 Fix For: 0.20-append

 Attachments: hdfs-1204.txt, hdfs-1204.txt


 This was brought up in HDFS-200 but didn't make it into the branch on Apache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1118) DFSOutputStream socket leak when cannot connect to DataNode

2010-06-16 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879626#action_12879626
 ] 

dhruba borthakur commented on HDFS-1118:


Code looks good to me. I will commit this to trunk.

 DFSOutputStream socket leak when cannot connect to DataNode
 ---

 Key: HDFS-1118
 URL: https://issues.apache.org/jira/browse/HDFS-1118
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append, 0.20.1, 0.20.2
Reporter: Zheng Shao
Assignee: Zheng Shao
 Fix For: 0.20-append

 Attachments: HDFS-1118.1.patch, HDFS-1118.2.patch


 The offending code is in {{DFSOutputStream.nextBlockOutputStream}}
 This function retries several times to call {{createBlockOutputStream}}. Each 
 time when it fails, it leaves a {{Socket}} object in {{DFSOutputStream.s}}.
 That object is never closed, but overwritten the next time 
 {{createBlockOutputStream}} is called.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1210) DFSClient should log exception when block recovery fails

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1210.


Fix Version/s: 0.20-append
   Resolution: Fixed

I just committed this. Thanks Todd.

 DFSClient should log exception when block recovery fails
 

 Key: HDFS-1210
 URL: https://issues.apache.org/jira/browse/HDFS-1210
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs client
Affects Versions: 0.20-append, 0.20.2
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Trivial
 Fix For: 0.20-append

 Attachments: hdfs-1210.txt


 Right now we just retry without necessarily showing the exception. It can be 
 useful to see what the error was that prevented the recovery RPC from 
 succeeding.
 (I believe this only applies in 0.20 style of block recovery)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1219) Data Loss due to edits log truncation

2010-06-16 Thread Thanh Do (JIRA)
Data Loss due to edits log truncation
-

 Key: HDFS-1219
 URL: https://issues.apache.org/jira/browse/HDFS-1219
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.2
Reporter: Thanh Do


We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)
Namenode unable to start due to truncated fstime


 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time 

 
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }
 
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1220:
---

Description: 
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

  was:
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time 

 
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }
 
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu


 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1220:
---

Description: 
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

  was:
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }


Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu


 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1220:
---

Description: 
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Below is the code for updating fstime file on disk
  void writeCheckpointTime(StorageDirectory sd) throws IOException {
if (checkpointTime  0L)
  return; // do not write negative time
File timeFile = getImageFile(sd, NameNodeFile.TIME);
if (timeFile.exists()) { timeFile.delete(); }
DataOutputStream out = new DataOutputStream(
new FileOutputStream(timeFile));
try {
  out.writeLong(checkpointTime);
} finally {
  out.close();
}
  }


Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

  was:
- Summary: updating fstime file on disk is not atomic, so it is possible that
if a crash happens in the middle, next time when NameNode reboots, it will
read stale fstime, hence unable to start successfully.
 
- Details:
Basically, this involve 3 steps:
1) delete fstime file (timeFile.delete())
2) truncate fstime file (new FileOutputStream(timeFile))
3) write new time to fstime file (out.writeLong(checkpointTime))
If a crash happens after step 2 and before step 3, in the next reboot, NameNode
got an exception when reading the time (8 byte) from an empty fstime file.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu


 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Below is the code for updating fstime file on disk
   void writeCheckpointTime(StorageDirectory sd) throws IOException {
 if (checkpointTime  0L)
   return; // do not write negative time
 File timeFile = getImageFile(sd, NameNodeFile.TIME);
 if (timeFile.exists()) { timeFile.delete(); }
 DataOutputStream out = new DataOutputStream(
 new 
 FileOutputStream(timeFile));
 try {
   out.writeLong(checkpointTime);
 } finally {
   out.close();
 }
   }
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1221) NameNode unable to start due to stale edits log after a crash

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1221:
---

Description: 
- Summary: 
If a crash happens during FSEditLog.createEditLogFile(), the
edits log file on disk may be stale. During next reboot, NameNode 
will get an exception when parsing the edits file, because of stale data, 
leading to unsuccessful reboot.
Note: This is just one example. Since we see that edits log (and fsimage)
does not have checksum, they are vulnerable to corruption too.
 
- Details:
The steps to create new edits log (which we infer from HDFS code) are:
1) truncate the file to zero size
2) write FSConstants.LAYOUT_VERSION to buffer
3) insert the end-of-file marker OP_INVALID to the end of the buffer
4) preallocate 1MB of data, and fill the data with 0
5) flush the buffer to disk
 
Note that only in step 1, 4, 5, the data on disk is actually changed.
Now, suppose a crash happens after step 4, but before step 5.
In the next reboot, NameNode will fetch this edits log file (which contains
all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
because NameNode has code to handle that case.
(but we expect LAYOUT_VERSION to be -18, don't we). 
Now it parses the operation code, which happens to be 0. Unfortunately, since 0
is the value for OP_ADD, the NameNode expects some parameters corresponding 
to that operation. Now NameNode calls readString to read the path, which throws
an exception leading to a failed reboot.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

  was:
- Summary: 
If a crash happens during FSEditLog.createEditLogFile(), the
edits log file on disk may be stale. During next reboot, NameNode 
will get an exception when parsing the edits file, because of stale data, 
leading to unsuccessful reboot.
Note: This is just one example. Since we see that edits log (and fsimage)
does not have checksum, they are vulnerable to corruption too.
 
- Details:
The steps to create new edits log (which we infer from HDFS code) are:
1) truncate the file to zero size
2) write FSConstants.LAYOUT_VERSION to buffer
3) insert the end-of-file marker OP_INVALID to the end of the buffer
4) preallocate 1MB of data, and fill the data with 0
5) flush the buffer to disk
 
Note that only in step 1, 4, 5, the data on disk is actually changed.
Now, suppose a crash happens after step 4, but before step 5.
In the next reboot, NameNode will fetch this edits log file (which contains
all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
because NameNode has code to handle that case.
(but we expect LAYOUT_VERSION to be -18, don't we). 
Now it parses the operation code, which happens to be 0. Unfortunately, since 0
is the value for OP_ADD, the NameNode expects some parameters corresponding 
to that operation. Now NameNode calls readString to read the path, which throws
an exception leading to a failed reboot.

We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

Component/s: name-node

 NameNode unable to start due to stale edits log after a crash
 -

 Key: HDFS-1221
 URL: https://issues.apache.org/jira/browse/HDFS-1221
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: 
 If a crash happens during FSEditLog.createEditLogFile(), the
 edits log file on disk may be stale. During next reboot, NameNode 
 will get an exception when parsing the edits file, because of stale data, 
 leading to unsuccessful reboot.
 Note: This is just one example. Since we see that edits log (and fsimage)
 does not have checksum, they are vulnerable to corruption too.
  
 - Details:
 The steps to create new edits log (which we infer from HDFS code) are:
 1) truncate the file to zero size
 2) write FSConstants.LAYOUT_VERSION to buffer
 3) insert the end-of-file marker OP_INVALID to the end of the buffer
 4) preallocate 1MB of data, and fill the data with 0
 5) flush the buffer to disk
  
 Note that only in step 1, 4, 5, the data on disk is actually changed.
 Now, suppose a crash happens after step 4, but before 

[jira] Created: (HDFS-1221) NameNode unable to start due to stale edits log after a crash

2010-06-16 Thread Thanh Do (JIRA)
NameNode unable to start due to stale edits log after a crash
-

 Key: HDFS-1221
 URL: https://issues.apache.org/jira/browse/HDFS-1221
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: 
If a crash happens during FSEditLog.createEditLogFile(), the
edits log file on disk may be stale. During next reboot, NameNode 
will get an exception when parsing the edits file, because of stale data, 
leading to unsuccessful reboot.
Note: This is just one example. Since we see that edits log (and fsimage)
does not have checksum, they are vulnerable to corruption too.
 
- Details:
The steps to create new edits log (which we infer from HDFS code) are:
1) truncate the file to zero size
2) write FSConstants.LAYOUT_VERSION to buffer
3) insert the end-of-file marker OP_INVALID to the end of the buffer
4) preallocate 1MB of data, and fill the data with 0
5) flush the buffer to disk
 
Note that only in step 1, 4, 5, the data on disk is actually changed.
Now, suppose a crash happens after step 4, but before step 5.
In the next reboot, NameNode will fetch this edits log file (which contains
all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
because NameNode has code to handle that case.
(but we expect LAYOUT_VERSION to be -18, don't we). 
Now it parses the operation code, which happens to be 0. Unfortunately, since 0
is the value for OP_ADD, the NameNode expects some parameters corresponding 
to that operation. Now NameNode calls readString to read the path, which throws
an exception leading to a failed reboot.

We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1222) NameNode fail stop in spite of multiple metadata directories

2010-06-16 Thread Thanh Do (JIRA)
NameNode fail stop in spite of multiple metadata directories


 Key: HDFS-1222
 URL: https://issues.apache.org/jira/browse/HDFS-1222
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do


Despite the ability to configure multiple name directories
(to store fsimage) and edits directories, the NameNode will fail stop 
in most of the time it faces exception when accessing to these directories.
 
NameNode fail stops if an exception happens when loading fsimage,
reading fstime, loading edits log, writing fsimage.ckpt ..., although there 
are still good replicas. NameNode could have tried to work with other replicas,
and marked the faulty one.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1223) DataNode fails stop due to a bad disk (or storage directory)

2010-06-16 Thread Thanh Do (JIRA)
DataNode fails stop due to a bad disk (or storage directory)


 Key: HDFS-1223
 URL: https://issues.apache.org/jira/browse/HDFS-1223
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


A datanode can store block files in multiple volumes.
If a datanode sees a bad volume during start up (i.e, face an exception
when accessing that volume), it simply fail stops, making all block files
stored in other healthy volumes inaccessible. Consequently, these lost
replicas will be generated later on in other datanodes. 
If a datanode is able to mark the bad disk and continue working with
healthy ones, this will increase availability and avoid unnecessary 
regeneration. As an extreme example, consider one datanode which has
2 volumes V1 and V2, each contains about 1 64MB block files.
During startup, the datanode gets an exception when accessing V1, it then 
fail stops, making 2 block files generated later on.
If the datanode masks V1 as bad and continues working with V2, the number
of replicas needed to be regenerated is cut in to half.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1224) Stale connection makes node miss append

2010-06-16 Thread Thanh Do (JIRA)
Stale connection makes node miss append
---

 Key: HDFS-1224
 URL: https://issues.apache.org/jira/browse/HDFS-1224
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Thanh Do


- Summary: if a datanode crashes and restarts, it may miss an append.
 
- Setup:
+ available datanodes = 3
+ replica = 3 
+ disks / datanode = 1
+ failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed
 
- Details:
Since each datanode maintains a pool of IPC connections, whenever it wants
to make an IPC call, it first looks into the pool. If the connection is not 
there, 
it is created and put in to the pool. Otherwise the existing connection is used.
Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary.
After the client appends to block X successfully, dn2 crashes and restarts.
Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
Client starts appending to block Y. It first calls dn1.recoverBlock().
Dn1 will first create a proxy corresponding with each of the datanode in the 
pipeline
(in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
However, because
dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
Dn1 uses
this connection to make a call to dn2, hence an exception. Therefore, append 
will be
made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 
has
been successful.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1224) Stale connection makes node miss append

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1224:
---

Affects Version/s: 0.20.1
  Component/s: data-node

 Stale connection makes node miss append
 ---

 Key: HDFS-1224
 URL: https://issues.apache.org/jira/browse/HDFS-1224
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: if a datanode crashes and restarts, it may miss an append.
  
 - Setup:
 + # available datanodes = 3
 + # replica = 3 
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = after the first append succeed
  
 - Details:
 Since each datanode maintains a pool of IPC connections, whenever it wants
 to make an IPC call, it first looks into the pool. If the connection is not 
 there, 
 it is created and put in to the pool. Otherwise the existing connection is 
 used.
 Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the 
 primary.
 After the client appends to block X successfully, dn2 crashes and restarts.
 Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
 Client starts appending to block Y. It first calls dn1.recoverBlock().
 Dn1 will first create a proxy corresponding with each of the datanode in the 
 pipeline
 (in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
 However, because
 dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
 Dn1 uses
 this connection to make a call to dn2, hence an exception. Therefore, append 
 will be
 made only to dn1 and dn3, although dn2 is alive and the write of block Y to 
 dn2 has
 been successful.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1224) Stale connection makes node miss append

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1224:
---

Description: 
- Summary: if a datanode crashes and restarts, it may miss an append.
 
- Setup:
+ # available datanodes = 3
+ # replica = 3 
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed
 
- Details:
Since each datanode maintains a pool of IPC connections, whenever it wants
to make an IPC call, it first looks into the pool. If the connection is not 
there, 
it is created and put in to the pool. Otherwise the existing connection is used.
Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary.
After the client appends to block X successfully, dn2 crashes and restarts.
Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
Client starts appending to block Y. It first calls dn1.recoverBlock().
Dn1 will first create a proxy corresponding with each of the datanode in the 
pipeline
(in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
However, because
dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
Dn1 uses
this connection to make a call to dn2, hence an exception. Therefore, append 
will be
made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 
has
been successful.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

  was:
- Summary: if a datanode crashes and restarts, it may miss an append.
 
- Setup:
+ available datanodes = 3
+ replica = 3 
+ disks / datanode = 1
+ failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed
 
- Details:
Since each datanode maintains a pool of IPC connections, whenever it wants
to make an IPC call, it first looks into the pool. If the connection is not 
there, 
it is created and put in to the pool. Otherwise the existing connection is used.
Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary.
After the client appends to block X successfully, dn2 crashes and restarts.
Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
Client starts appending to block Y. It first calls dn1.recoverBlock().
Dn1 will first create a proxy corresponding with each of the datanode in the 
pipeline
(in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
However, because
dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
Dn1 uses
this connection to make a call to dn2, hence an exception. Therefore, append 
will be
made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 
has
been successful.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)


 Stale connection makes node miss append
 ---

 Key: HDFS-1224
 URL: https://issues.apache.org/jira/browse/HDFS-1224
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: if a datanode crashes and restarts, it may miss an append.
  
 - Setup:
 + # available datanodes = 3
 + # replica = 3 
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = after the first append succeed
  
 - Details:
 Since each datanode maintains a pool of IPC connections, whenever it wants
 to make an IPC call, it first looks into the pool. If the connection is not 
 there, 
 it is created and put in to the pool. Otherwise the existing connection is 
 used.
 Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the 
 primary.
 After the client appends to block X successfully, dn2 crashes and restarts.
 Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
 Client starts appending to block Y. It first calls dn1.recoverBlock().
 Dn1 will first create a proxy corresponding with each of the datanode in the 
 pipeline
 (in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
 However, because
 dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
 Dn1 uses
 this connection to make a call to dn2, hence an exception. Therefore, append 
 will be
 made only to dn1 and dn3, although dn2 is alive and the write of block Y to 
 dn2 has
 been successful.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, 

[jira] Created: (HDFS-1225) Block lost when primary crashes in recoverBlock

2010-06-16 Thread Thanh Do (JIRA)
Block lost when primary crashes in recoverBlock
---

 Key: HDFS-1225
 URL: https://issues.apache.org/jira/browse/HDFS-1225
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: Block is lost if primary datanode crashes in the middle 
tryUpdateBlock.
 
- Setup:
# available datanode = 2
# replica = 2
# disks / datanode = 1
# failures = 1
# failure type = crash
When/where failure happens = (see below)
 
- Details:
 Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary.
Client appends to blk_X_1001 and crash happens during dn1.recoverBlock,
at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002
**Interesting**, this case, the block X is lost eventually. Why?
After dn1.recoverBlock crashes at rename, what left at dn1 current directory is:
1) blk_X

 
2) blk_X_1001.meta_tmp1002
== this is an invalid block, because it has no meta file associated with it.
dn2 (after dn1 crash) now contains:
1) blk_X

 
2) blk_X_1002.meta
(note that the rename at dn2 is completed, because dn1 called dn2.updateBlock() 
before
calling its own updateBlock())
But the command namenode.commitBlockSynchronization is not reported to namenode,
because dn1 is crashed. Therefore, from namenode point of view, the block X has 
GS 1001.
Hence, the block is lost.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender

2010-06-16 Thread Thanh Do (JIRA)
Last block is temporary unavailable for readers because of crashed appender
---

 Key: HDFS-1226
 URL: https://issues.apache.org/jira/browse/HDFS-1226
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: the last block is unavailable to subsequent readers if appender 
crashes in the
middle of appending workload.
 
- Setup:
# available datanodes = 3
# disks / datanode = 1
# failures = 1
failure type = crash
When/where failure happens = (see below)
 
- Details:
Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
successful 
recoverBlock at primary datanode, client calls createOutputStream, which make 
all datanodes
move the block file and the meta file from current directory to tmp directory. 
Now suppose
the client crashes. Since all replicas of block X are in tmp folders of 
corresponding datanode,
subsequent readers cannot read block X.

- Summary: the last block is unavailable to subsequent readers if appender 
crashes in the
middle of appending workload.
 
- Setup:
# available datanodes = 3
# disks / datanode = 1
# failures = 1
failure type = crash
When/where failure happens = (see below)
 
- Details:
 
Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
successful 
recoverBlock at primary datanode, client calls createOutputStream, which make 
all datanodes
move the block file and the meta file from current directory to tmp directory. 
Now suppose
the client crashes. Since all replicas of block X are in tmp folders of 
corresponding datanode,
subsequent readers cannot read block X.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1226:
---

Description: 
- Summary: the last block is unavailable to subsequent readers if appender 
crashes in the
middle of appending workload.
 
- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = (see below)
 
- Details:
Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
successful 
recoverBlock at primary datanode, client calls createOutputStream, which make 
all datanodes
move the block file and the meta file from current directory to tmp directory. 
Now suppose
the client crashes. Since all replicas of block X are in tmp folders of 
corresponding datanode,
subsequent readers cannot read block X.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)


  was:
- Summary: the last block is unavailable to subsequent readers if appender 
crashes in the
middle of appending workload.
 
- Setup:
# available datanodes = 3
# disks / datanode = 1
# failures = 1
failure type = crash
When/where failure happens = (see below)
 
- Details:
Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
successful 
recoverBlock at primary datanode, client calls createOutputStream, which make 
all datanodes
move the block file and the meta file from current directory to tmp directory. 
Now suppose
the client crashes. Since all replicas of block X are in tmp folders of 
corresponding datanode,
subsequent readers cannot read block X.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)



 Last block is temporary unavailable for readers because of crashed appender
 ---

 Key: HDFS-1226
 URL: https://issues.apache.org/jira/browse/HDFS-1226
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: the last block is unavailable to subsequent readers if appender 
 crashes in the
 middle of appending workload.
  
 - Setup:
 + # available datanodes = 3
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = (see below)
  
 - Details:
 Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
 successful 
 recoverBlock at primary datanode, client calls createOutputStream, which make 
 all datanodes
 move the block file and the meta file from current directory to tmp 
 directory. Now suppose
 the client crashes. Since all replicas of block X are in tmp folders of 
 corresponding datanode,
 subsequent readers cannot read block X.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1226:
---

Description: 
- Summary: the last block is unavailable to subsequent readers if appender 
crashes in the
middle of appending workload.
 
- Setup:
# available datanodes = 3
# disks / datanode = 1
# failures = 1
failure type = crash
When/where failure happens = (see below)
 
- Details:
Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
successful 
recoverBlock at primary datanode, client calls createOutputStream, which make 
all datanodes
move the block file and the meta file from current directory to tmp directory. 
Now suppose
the client crashes. Since all replicas of block X are in tmp folders of 
corresponding datanode,
subsequent readers cannot read block X.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)


  was:
- Summary: the last block is unavailable to subsequent readers if appender 
crashes in the
middle of appending workload.
 
- Setup:
# available datanodes = 3
# disks / datanode = 1
# failures = 1
failure type = crash
When/where failure happens = (see below)
 
- Details:
Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
successful 
recoverBlock at primary datanode, client calls createOutputStream, which make 
all datanodes
move the block file and the meta file from current directory to tmp directory. 
Now suppose
the client crashes. Since all replicas of block X are in tmp folders of 
corresponding datanode,
subsequent readers cannot read block X.

- Summary: the last block is unavailable to subsequent readers if appender 
crashes in the
middle of appending workload.
 
- Setup:
# available datanodes = 3
# disks / datanode = 1
# failures = 1
failure type = crash
When/where failure happens = (see below)
 
- Details:
 
Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
successful 
recoverBlock at primary datanode, client calls createOutputStream, which make 
all datanodes
move the block file and the meta file from current directory to tmp directory. 
Now suppose
the client crashes. Since all replicas of block X are in tmp folders of 
corresponding datanode,
subsequent readers cannot read block X.


 Last block is temporary unavailable for readers because of crashed appender
 ---

 Key: HDFS-1226
 URL: https://issues.apache.org/jira/browse/HDFS-1226
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: the last block is unavailable to subsequent readers if appender 
 crashes in the
 middle of appending workload.
  
 - Setup:
 # available datanodes = 3
 # disks / datanode = 1
 # failures = 1
 failure type = crash
 When/where failure happens = (see below)
  
 - Details:
 Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
 successful 
 recoverBlock at primary datanode, client calls createOutputStream, which make 
 all datanodes
 move the block file and the meta file from current directory to tmp 
 directory. Now suppose
 the client crashes. Since all replicas of block X are in tmp folders of 
 corresponding datanode,
 subsequent readers cannot read block X.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1227) UpdateBlock fails due to unmatched file length

2010-06-16 Thread Thanh Do (JIRA)
UpdateBlock fails due to unmatched file length
--

 Key: HDFS-1227
 URL: https://issues.apache.org/jira/browse/HDFS-1227
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: client append is not atomic, hence, it is possible that
when retrying during append, there is an exception in updateBlock
indicating unmatched file length, making append failed.
 
- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 2
+ failure type = bad disk
+ When/where failure happens = (see below)
+ This bug is non-deterministic, to reproduce it, add a sufficient sleep before 
out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3
 
- Details:
 Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, 
dn2, dn3.
Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds.
Client starts sending data to the dn3 - the first datanode in pipeline.
dn3 forwards the packet to downstream datanodes, and starts writing
data to its disk. Suppose there is an exception in dn3 when writing to disk.
Client gets the exception, it starts the recovery code by calling 
dn1.recoverBlock() again.
dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build the 
syncList.
Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and dn2),
the previous packet (which is sent from dn3) has not come to disk yet.
Hence, the block Info given by getMetaDataInfo contains the length of 16 bytes.
But after that, the packet comes to disk, making the block file length now 
becomes 32 bytes.
Using the syncList (with contains block info with length 16 byte), dn1 calls 
updateBlock at
dn2 and dn1, which will failed, because the length of new block info (given by 
updateBlock,
which is 16 byte) does not match with its actual length on disk (which is 32 
byte)
 
Note that this bug is non-deterministic. Its depends on the thread interleaving
at datanodes.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1219) Data Loss due to edits log truncation

2010-06-16 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879655#action_12879655
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1219:
--

Then, is this the same as HDFS-955?

 Data Loss due to edits log truncation
 -

 Key: HDFS-1219
 URL: https://issues.apache.org/jira/browse/HDFS-1219
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.2
Reporter: Thanh Do

 We found this problem almost at the same time as HDFS developers.
 Basically, the edits log is truncated before fsimage.ckpt is renamed to 
 fsimage.
 Hence, any crash happens after the truncation but before the renaming will 
 lead
 to a data loss. Detailed description can be found here:
 https://issues.apache.org/jira/browse/HDFS-955
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1228) CRC does not match when retrying appending a partial block

2010-06-16 Thread Thanh Do (JIRA)
CRC does not match when retrying appending a partial block
--

 Key: HDFS-1228
 URL: https://issues.apache.org/jira/browse/HDFS-1228
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: when appending to partial block, if is possible that
retrial when facing an exception fails due to a checksum mismatch.
Append operation is not atomic (either complete or fail completely).
 
- Setup:
+ # available datanodes = 2
+# disks / datanode = 1
+ # failures = 1
+ failure type = bad disk
+ When/where failure happens = (see below)
 
- Details:
Client writes 16 bytes to dn1 and dn2. Write completes. So far so good.
The meta file now contains: 7 bytes header + 4 byte checksum (CK1 -
checksum for 16 byte) Client then appends 16 bytes more, and let assume there 
is an
exception at BlockReceiver.receivePacket() at dn2. So the client knows dn2
is bad. BUT, the append at dn1 is complete (i.e the data portion and checksum 
portion
has been made to disk to the corresponding block file and meta file), meaning 
that the
checksum file at dn1 now contains 7 bytes header + 4 byte checksum (CK2 - this 
is
checksum for 32 byte data). Because dn2 has an exception, client calls 
recoverBlock and
starts append again to dn1. dn1 receives 16 byte data, it verifies if the 
pre-computed
crc (CK2) matches what we recalculate just now (CK1), which obviously does not 
match.
Hence an exception and retrial fails.
 
- a similar bug has been reported at
https://issues.apache.org/jira/browse/HDFS-679
but here, it manifests in different context.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock

2010-06-16 Thread Thanh Do (JIRA)
DFSClient incorrectly asks for new block if primary crashes during first 
recoverBlock
-

 Key: HDFS-1229
 URL: https://issues.apache.org/jira/browse/HDFS-1229
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Setup:
+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = during primary's recoverBlock
 
- Details:
Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
First it needs to make sure both dn1 and dn2  agree on the new GS of the block.
1) Client first creates DFSOutputStream by calling
 
OutputStream result = new DFSOutputStream(src, buffersize, progress,
lastBlock, stat, 
 conf.getInt(io.bytes.per.checksum, 512));
 
in DFSClient.append()
 
2) The above DFSOutputStream constructor in turn calls 
processDataNodeError(true, true)
(i.e, hasError = true, isAppend = true), and starts the DataStreammer
 
 processDatanodeError(true, true);  /* let's call this PDNE 1 */
 streamer.start();
 
Note that DataStreammer.run() also calls processDatanodeError()
 while (!closed  clientRunning) {
  ...
  boolean doSleep = processDatanodeError(hasError, false); /let's call 
 this PDNE 2*/
 
3) Now in the PDNE 1, we have following code:
 
 blockStream = null;
 blockReplyStream = null;
 ...
 while (!success  clientRunning) {
 ...
try {
 primary = createClientDatanodeProtocolProxy(primaryNode, conf);
 newBlock = primary.recoverBlock(block, isAppend, newnodes); 
 /*exception here*/
 ...
catch (IOException e) {
 ...
 if (recoveryErrorCount  maxRecoveryErrorCount) { 
 /* this condition is false */
 }
 ...
 return true;
} // end catch
finally {...}

this.hasError = false;
lastException = null;
errorIndex = 0;
success = createBlockOutputStream(nodes, clientName, true);
}
...
 
Because dn1 crashes during client call to recoverBlock, we have an exception.
Hence, go to the catch block, in which processDatanodeError returns true
before setting hasError to false. Also, because createBlockOutputStream() is 
not called
(due to an early return), blockStream is still null.
 
4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2.
Because hasError = false, PDNE 2 returns false immediately without doing 
anything
if (!hasError) {
 return false;
}
 
5) still in the DataStreamer.run(), after returning false from PDNE 2, we still 
have
blockStream = null, hence the following code is executed:
  if (blockStream == null) {
   nodes = nextBlockOutputStream(src);
   this.setName(DataStreamer for file  + src +
  block  + block);
response = new ResponseProcessor(nodes);
response.start();
  }
 
nextBlockOutputStream which asks namenode to allocate new Block is called.
(This is not good, because we are appending, not writing).
Namenode gives it new Block ID and a set of datanodes, including crashed dn1.
this leads to createOutputStream() fails because it tries to contact the dn1 
first.
(which has crashed). The client retries 5 times without any success,
because every time, it asks namenode for new block! Again we see
that the retry logic at client is weird!

*This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1229:
---

Description: 
Setup:

+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = during primary's recoverBlock
 
Details:
--
Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
First it needs to make sure both dn1 and dn2  agree on the new GS of the block.
1) Client first creates DFSOutputStream by calling
 
OutputStream result = new DFSOutputStream(src, buffersize, progress,
lastBlock, stat, 
 conf.getInt(io.bytes.per.checksum, 512));
 
in DFSClient.append()
 
2) The above DFSOutputStream constructor in turn calls 
processDataNodeError(true, true)
(i.e, hasError = true, isAppend = true), and starts the DataStreammer
 
 processDatanodeError(true, true);  /* let's call this PDNE 1 */
 streamer.start();
 
Note that DataStreammer.run() also calls processDatanodeError()
 while (!closed  clientRunning) {
  ...
  boolean doSleep = processDatanodeError(hasError, false); /let's call 
 this PDNE 2*/
 
3) Now in the PDNE 1, we have following code:
 
 blockStream = null;
 blockReplyStream = null;
 ...
 while (!success  clientRunning) {
 ...
try {
 primary = createClientDatanodeProtocolProxy(primaryNode, conf);
 newBlock = primary.recoverBlock(block, isAppend, newnodes); 
 /*exception here*/
 ...
catch (IOException e) {
 ...
 if (recoveryErrorCount  maxRecoveryErrorCount) { 
 /* this condition is false */
 }
 ...
 return true;
} // end catch
finally {...}

this.hasError = false;
lastException = null;
errorIndex = 0;
success = createBlockOutputStream(nodes, clientName, true);
}
...
 
Because dn1 crashes during client call to recoverBlock, we have an exception.
Hence, go to the catch block, in which processDatanodeError returns true
before setting hasError to false. Also, because createBlockOutputStream() is 
not called
(due to an early return), blockStream is still null.
 
4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2.
Because hasError = false, PDNE 2 returns false immediately without doing 
anything
if (!hasError) {
 return false;
}
 
5) still in the DataStreamer.run(), after returning false from PDNE 2, we still 
have
blockStream = null, hence the following code is executed:
  if (blockStream == null) {
   nodes = nextBlockOutputStream(src);
   this.setName(DataStreamer for file  + src +
  block  + block);
response = new ResponseProcessor(nodes);
response.start();
  }
 
nextBlockOutputStream which asks namenode to allocate new Block is called.
(This is not good, because we are appending, not writing).
Namenode gives it new Block ID and a set of datanodes, including crashed dn1.
this leads to createOutputStream() fails because it tries to contact the dn1 
first.
(which has crashed). The client retries 5 times without any success,
because every time, it asks namenode for new block! Again we see
that the retry logic at client is weird!

*This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)*

  was:
- Setup:
+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = during primary's recoverBlock
 
- Details:
Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
First it needs to make sure both dn1 and dn2  agree on the new GS of the block.
1) Client first creates DFSOutputStream by calling
 
OutputStream result = new DFSOutputStream(src, buffersize, progress,
lastBlock, stat, 
 conf.getInt(io.bytes.per.checksum, 512));
 
in DFSClient.append()
 
2) The above DFSOutputStream constructor in turn calls 
processDataNodeError(true, true)
(i.e, hasError = true, isAppend = true), and starts the DataStreammer
 
 processDatanodeError(true, true);  /* let's call this PDNE 1 */
 streamer.start();
 
Note that DataStreammer.run() also calls processDatanodeError()
 while (!closed  clientRunning) {
  ...
  boolean doSleep = processDatanodeError(hasError, false); /let's call 
 this PDNE 2*/
 
3) Now in the PDNE 1, we have following code:
 
 blockStream = null;
 blockReplyStream = null;
 ...
 while (!success  clientRunning) {
 ...
try {
 primary = createClientDatanodeProtocolProxy(primaryNode, conf);
 newBlock = primary.recoverBlock(block, isAppend, newnodes); 
 /*exception here*/
 ...
catch (IOException e) {
 

[jira] Updated: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1229:
---

Description: 
Setup:

+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = during primary's recoverBlock
 
Details:
--
Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
First it needs to make sure both dn1 and dn2  agree on the new GS of the block.
1) Client first creates DFSOutputStream by calling
 
OutputStream result = new DFSOutputStream(src, buffersize, progress,
lastBlock, stat, 
 conf.getInt(io.bytes.per.checksum, 512));
 
in DFSClient.append()
 
2) The above DFSOutputStream constructor in turn calls 
processDataNodeError(true, true)
(i.e, hasError = true, isAppend = true), and starts the DataStreammer
 
 processDatanodeError(true, true);  /* let's call this PDNE 1 */
 streamer.start();
 
Note that DataStreammer.run() also calls processDatanodeError()
 while (!closed  clientRunning) {
  ...
  boolean doSleep = processDatanodeError(hasError, false); /let's call 
 this PDNE 2*/
 
3) Now in the PDNE 1, we have following code:
 
 blockStream = null;
 blockReplyStream = null;
 ...
 while (!success  clientRunning) {
 ...
try {
 primary = createClientDatanodeProtocolProxy(primaryNode, conf);
 newBlock = primary.recoverBlock(block, isAppend, newnodes); 
 /*exception here*/
 ...
catch (IOException e) {
 ...
 if (recoveryErrorCount  maxRecoveryErrorCount) { 
 // this condition is false
 }
 ...
 return true;
} // end catch
finally {...}

this.hasError = false;
lastException = null;
errorIndex = 0;
success = createBlockOutputStream(nodes, clientName, true);
}
...
 
Because dn1 crashes during client call to recoverBlock, we have an exception.
Hence, go to the catch block, in which processDatanodeError returns true
before setting hasError to false. Also, because createBlockOutputStream() is 
not called
(due to an early return), blockStream is still null.
 
4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2.
Because hasError = false, PDNE 2 returns false immediately without doing 
anything

 if (!hasError) { return false; }
 
5) still in the DataStreamer.run(), after returning false from PDNE 2, we still 
have
blockStream = null, hence the following code is executed:

if (blockStream == null) {
   nodes = nextBlockOutputStream(src);
   this.setName(DataStreamer for file  + src +  block  + block);
   response = new ResponseProcessor(nodes);
   response.start();
}
 
nextBlockOutputStream which asks namenode to allocate new Block is called.
(This is not good, because we are appending, not writing).
Namenode gives it new Block ID and a set of datanodes, including crashed dn1.
this leads to createOutputStream() fails because it tries to contact the dn1 
first.
(which has crashed). The client retries 5 times without any success,
because every time, it asks namenode for new block! Again we see
that the retry logic at client is weird!

*This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)*

  was:
Setup:

+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = during primary's recoverBlock
 
Details:
--
Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
First it needs to make sure both dn1 and dn2  agree on the new GS of the block.
1) Client first creates DFSOutputStream by calling
 
OutputStream result = new DFSOutputStream(src, buffersize, progress,
lastBlock, stat, 
 conf.getInt(io.bytes.per.checksum, 512));
 
in DFSClient.append()
 
2) The above DFSOutputStream constructor in turn calls 
processDataNodeError(true, true)
(i.e, hasError = true, isAppend = true), and starts the DataStreammer
 
 processDatanodeError(true, true);  /* let's call this PDNE 1 */
 streamer.start();
 
Note that DataStreammer.run() also calls processDatanodeError()
 while (!closed  clientRunning) {
  ...
  boolean doSleep = processDatanodeError(hasError, false); /let's call 
 this PDNE 2*/
 
3) Now in the PDNE 1, we have following code:
 
 blockStream = null;
 blockReplyStream = null;
 ...
 while (!success  clientRunning) {
 ...
try {
 primary = createClientDatanodeProtocolProxy(primaryNode, conf);
 newBlock = primary.recoverBlock(block, isAppend, newnodes); 
 /*exception here*/
 ...
catch (IOException e) {
 ...
 if (recoveryErrorCount  maxRecoveryErrorCount) { 
 /* this condition is false */
   

[jira] Created: (HDFS-1230) BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain t

2010-06-16 Thread Gokul (JIRA)
BlocksMap.blockinfo is not getting cleared immediately after deleting a 
block.This will be cleared only after block report comes from the datanode.Why 
we need to maintain the blockinfo till that time.


 Key: HDFS-1230
 URL: https://issues.apache.org/jira/browse/HDFS-1230
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.20.1
Reporter: Gokul


BlocksMap.blockinfo is not getting cleared immediately after deleting a 
block.This will be cleared only after block report comes from the datanode.Why 
we need to maintain the blockinfo till that time It increases namenode 
memory unnecessarily. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1227) UpdateBlock fails due to unmatched file length

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879662#action_12879662
 ] 

Todd Lipcon commented on HDFS-1227:
---

Believe this is addressed by HDFS-1186 in the 20-append branch

 UpdateBlock fails due to unmatched file length
 --

 Key: HDFS-1227
 URL: https://issues.apache.org/jira/browse/HDFS-1227
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: client append is not atomic, hence, it is possible that
 when retrying during append, there is an exception in updateBlock
 indicating unmatched file length, making append failed.
  
 - Setup:
 + # available datanodes = 3
 + # disks / datanode = 1
 + # failures = 2
 + failure type = bad disk
 + When/where failure happens = (see below)
 + This bug is non-deterministic, to reproduce it, add a sufficient sleep 
 before out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3
  
 - Details:
  Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, 
 dn2, dn3.
 Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds.
 Client starts sending data to the dn3 - the first datanode in pipeline.
 dn3 forwards the packet to downstream datanodes, and starts writing
 data to its disk. Suppose there is an exception in dn3 when writing to disk.
 Client gets the exception, it starts the recovery code by calling 
 dn1.recoverBlock() again.
 dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build 
 the syncList.
 Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and 
 dn2),
 the previous packet (which is sent from dn3) has not come to disk yet.
 Hence, the block Info given by getMetaDataInfo contains the length of 16 
 bytes.
 But after that, the packet comes to disk, making the block file length now 
 becomes 32 bytes.
 Using the syncList (with contains block info with length 16 byte), dn1 calls 
 updateBlock at
 dn2 and dn1, which will failed, because the length of new block info (given 
 by updateBlock,
 which is 16 byte) does not match with its actual length on disk (which is 32 
 byte)
  
 Note that this bug is non-deterministic. Its depends on the thread 
 interleaving
 at datanodes.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1226) Last block is temporary unavailable for readers because of crashed appender

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879663#action_12879663
 ] 

Todd Lipcon commented on HDFS-1226:
---

This is addressed by combination of HDFS-142, HDFS-200, HDFS-1057 in 20 append

 Last block is temporary unavailable for readers because of crashed appender
 ---

 Key: HDFS-1226
 URL: https://issues.apache.org/jira/browse/HDFS-1226
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: the last block is unavailable to subsequent readers if appender 
 crashes in the
 middle of appending workload.
  
 - Setup:
 + # available datanodes = 3
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = (see below)
  
 - Details:
 Say a client appending to block X at 3 datanodes: dn1, dn2 and dn3. After 
 successful 
 recoverBlock at primary datanode, client calls createOutputStream, which make 
 all datanodes
 move the block file and the meta file from current directory to tmp 
 directory. Now suppose
 the client crashes. Since all replicas of block X are in tmp folders of 
 corresponding datanode,
 subsequent readers cannot read block X.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1223) DataNode fails stop due to a bad disk (or storage directory)

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879665#action_12879665
 ] 

Todd Lipcon commented on HDFS-1223:
---

Already fixed by HDFS-457

 DataNode fails stop due to a bad disk (or storage directory)
 

 Key: HDFS-1223
 URL: https://issues.apache.org/jira/browse/HDFS-1223
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 A datanode can store block files in multiple volumes.
 If a datanode sees a bad volume during start up (i.e, face an exception
 when accessing that volume), it simply fail stops, making all block files
 stored in other healthy volumes inaccessible. Consequently, these lost
 replicas will be generated later on in other datanodes. 
 If a datanode is able to mark the bad disk and continue working with
 healthy ones, this will increase availability and avoid unnecessary 
 regeneration. As an extreme example, consider one datanode which has
 2 volumes V1 and V2, each contains about 1 64MB block files.
 During startup, the datanode gets an exception when accessing V1, it then 
 fail stops, making 2 block files generated later on.
 If the datanode masks V1 as bad and continues working with V2, the number
 of replicas needed to be regenerated is cut in to half.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1224) Stale connection makes node miss append

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879664#action_12879664
 ] 

Todd Lipcon commented on HDFS-1224:
---

If the node has crashed, the TCP connection should be broken and thus it won't 
re-use an existing connection, no?
Even so, does this cause any actual problems aside from a shorter pipeline?
Given we only cache IPC connections for a short amount of time, the likelihood 
of a DN restart while a connection is cached is very small

 Stale connection makes node miss append
 ---

 Key: HDFS-1224
 URL: https://issues.apache.org/jira/browse/HDFS-1224
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: if a datanode crashes and restarts, it may miss an append.
  
 - Setup:
 + # available datanodes = 3
 + # replica = 3 
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = after the first append succeed
  
 - Details:
 Since each datanode maintains a pool of IPC connections, whenever it wants
 to make an IPC call, it first looks into the pool. If the connection is not 
 there, 
 it is created and put in to the pool. Otherwise the existing connection is 
 used.
 Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the 
 primary.
 After the client appends to block X successfully, dn2 crashes and restarts.
 Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
 Client starts appending to block Y. It first calls dn1.recoverBlock().
 Dn1 will first create a proxy corresponding with each of the datanode in the 
 pipeline
 (in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
 However, because
 dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
 Dn1 uses
 this connection to make a call to dn2, hence an exception. Therefore, append 
 will be
 made only to dn1 and dn3, although dn2 is alive and the write of block Y to 
 dn2 has
 been successful.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime

2010-06-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879666#action_12879666
 ] 

Todd Lipcon commented on HDFS-1220:
---

I believe we fixed this in trunk by saving to an fsimage_ckpt dir and then 
moving it into place atomically once all the files are on disk. See HDFS-955?

 Namenode unable to start due to truncated fstime
 

 Key: HDFS-1220
 URL: https://issues.apache.org/jira/browse/HDFS-1220
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: updating fstime file on disk is not atomic, so it is possible that
 if a crash happens in the middle, next time when NameNode reboots, it will
 read stale fstime, hence unable to start successfully.
  
 - Details:
 Basically, this involve 3 steps:
 1) delete fstime file (timeFile.delete())
 2) truncate fstime file (new FileOutputStream(timeFile))
 3) write new time to fstime file (out.writeLong(checkpointTime))
 If a crash happens after step 2 and before step 3, in the next reboot, 
 NameNode
 got an exception when reading the time (8 byte) from an empty fstime file.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1219) Data Loss due to edits log truncation

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1219.
---

Resolution: Duplicate

Why file this bug if it's the same as 955?

 Data Loss due to edits log truncation
 -

 Key: HDFS-1219
 URL: https://issues.apache.org/jira/browse/HDFS-1219
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.2
Reporter: Thanh Do

 We found this problem almost at the same time as HDFS developers.
 Basically, the edits log is truncated before fsimage.ckpt is renamed to 
 fsimage.
 Hence, any crash happens after the truncation but before the renaming will 
 lead
 to a data loss. Detailed description can be found here:
 https://issues.apache.org/jira/browse/HDFS-955
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1231) Generation Stamp mismatches, leading to failed append

2010-06-16 Thread Thanh Do (JIRA)
Generation Stamp mismatches, leading to failed append
-

 Key: HDFS-1231
 URL: https://issues.apache.org/jira/browse/HDFS-1231
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: the recoverBlock is not atomic, leading retrial fails when 
facing a failure.
 
- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 2
+ failure type = crash
+ When/where failure happens = (see below)
 
- Details:
Suppose there are 3 datanodes in the pipeline: dn3, dn2, and dn1. Dn1 is 
primary.
When appending, client first calls dn1.recoverBlock to make all the datanodes 
in 
pipeline agree on the new Generation Stamp (GS1) and the length of the block.
Client then sends a data packet to dn3. dn3 in turn forwards this packet to 
down stream
dns (dn2 and dn1) and starts writing to its own disk, then it crashes AFTER 
writing to the block
file but BEFORE writing to the meta file. Client notices the crash, it calls 
dn1.recoverBlock().
dn1.recoverBlock() first creates a syncList (by calling getMetadataInfo at all 
dn2 and dn1).
Then dn1 calls NameNode.getNextGS() to get new Generation Stamp (GS2).
Then it calls dn2.updateBlock(), this returns successfully.
Now, it starts calling its own updateBlock and crashes after renaming from
blk_X_GS1.meta to blk_X_GS1.meta_tmpGS2.
Therefore, dn1.recoverBlock() from the client point of view fails.
but the GS for corresponding block has been incremented in the namenode (GS2)
The client retries by calling dn2.recoverBlock with old GS (GS1), which does 
not match with
the new GS at the NameNode (GS1) --exception, leading to append fails.
 
Now, after all, we have
- in dn3 (which is crashed)
tmp/blk_X
tmp/blk_X_GS1.meta
- in dn2
current/blk_X
current/blk_X_GS2
- in dn1:
current/blk_X
current/blk_X_GS1.meta_tmpGS2
- in NameNode, the block X has generation stamp GS1 (because dn1 has not called
commitSyncronization yet).
 
Therefore, when crashed datanodes restart, at dn1 the block is invalid because 
there is no meta file. In dn3, block file and meta file are finalized, however, 
the 
block is corrupted because CRC mismatch. In dn2, the GS of the block is GS2,
which is not equal with the generation stamp info of the block maintained in 
NameNode.
Hence, the block blk_X is inaccessible.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1233) Bad retry logic at DFSClient

2010-06-16 Thread Thanh Do (JIRA)
Bad retry logic at DFSClient


 Key: HDFS-1233
 URL: https://issues.apache.org/jira/browse/HDFS-1233
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: failover bug, bad retry logic at DFSClient, cannot failover to the 
2nd disk
 
- Setups:
+ # available datanodes = 1
+ # disks / datanode = 2
+ # failures = 1
+ failure type = bad disk
+ When/where failure happens = (see below)
 
- Details:

The setup is:
1 datanode, 1 replica, and each datanode has 2 disks (Disk1 and Disk2).
 
We injected a single disk failure to see if we can failover to the
second disk or not.
 
If a persistent disk failure happens during createBlockOutputStream
(the first phase of the pipeline creation) (e.g. say DN1-Disk1 is bad),
then createBlockOutputStream (cbos) will get an exception and it
will retry!  When it retries it will get the same DN1 from the namenode,
and then DN1 will call DN.writeBlock(), FSVolume.createTmpFile,
and finally getNextVolume() which a moving volume#.  Thus, on the
second try, the write will be successfully go to the second disk.
So essentially createBlockOutputStream is wrapped in a
do/while(retry  --count = 0). The first cbos will fail, the second
will be successful in this particular scenario.
 
NOW, say cbos is successful, but the failure is persistent.
Then the retry is in a different while loop.
First, hasError is set to true in RP.run (responder packet).
Thus, DataStreamer.run() will go back to the loop:
while(!closed  clientRunning  !lastPacketInBlock).
Now this second iteration of the loop will call
processDatanodeError because hasError has been set to true.
In processDatanodeError (pde), the client sees that this is the only datanode
in the pipeline, and hence it considers that the node is bad! Although actually
only 1 disk is bad!  Hence, pde throws IOException suggesting
all the datanodes (in this case, only DN1) in the pipeline is bad.
Hence, in this error, the exception is thrown to the client.
But if the exception, say, is catched by the most outer while loop
do-while(retry  --count = 0), then this outer retry will be
successful then (as suggested in the previous paragraph).
 
In summary, if in a deployment scenario, we only have one datanode
that has multiple disks, and one disk goes bad, then the current
retry logic at the DFSClient side is not robust enough to mask the
failure from the client.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1232) Corrupted block if a crash happens before writing to checksumOut but after writing to dataOut

2010-06-16 Thread Thanh Do (JIRA)
Corrupted block if a crash happens before writing to checksumOut but after 
writing to dataOut
-

 Key: HDFS-1232
 URL: https://issues.apache.org/jira/browse/HDFS-1232
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: block is corrupted if a crash happens before writing to checksumOut 
but
after writing to dataOut. 
 
- Setup:
+ # available datanodes = 1
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+When/where failure happens = (see below)
 
- Details:
The order of processing a packet during client write/append at datanode
is first forward the packet to downstream, then write to data the block file, 
and 
and finally, write to the checksum file. Hence if a crash happens BEFORE the 
write
to checksum file but AFTER the write to data file, the block is corrupted.
Worse, if this is the only available replica, the block is lost.
 
We also found this problem in case there are 3 replicas for a particular block,
and during append, there are two failures. (see HDFS-1231)

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1234) Datanode 'alive' but with its disk failed, Namenode thinks it's alive

2010-06-16 Thread Thanh Do (JIRA)
Datanode 'alive' but with its disk failed, Namenode thinks it's alive
-

 Key: HDFS-1234
 URL: https://issues.apache.org/jira/browse/HDFS-1234
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: Datanode 'alive' but with its disk failed, Namenode still thinks 
it's alive
 
- Setups:
+ Replication = 1
+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ Failure type = bad disk
+ When/where failure happens = first phase of the pipeline
 
- Details:
In this experiment we have two datanodes. Each node has 1 disk.
However, if one datanode has a failed disk (but the node is still alive), the 
datanode
does not keep track of this.  From the perspective of the namenode,
that datanode is still alive, and thus the namenode gives back the same datanode
to the client.  The client will retry 3 times by asking the namenode to
give a new set of datanodes, and always get the same datanode.
And every time the client wants to write there, it gets an exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1235) Namenode returning the same Datanode to client, due to infrequent heartbeat

2010-06-16 Thread Thanh Do (JIRA)
Namenode returning the same Datanode to client, due to infrequent heartbeat
---

 Key: HDFS-1235
 URL: https://issues.apache.org/jira/browse/HDFS-1235
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Thanh Do


This bug has been reported.
Basically since datanode's hearbeat messages are infrequent (~ every 10 
minutes),
NameNode always gives the client the same datanode even if the datanode is dead.
 
We want to point out that the client wait 6 seconds before retrying,
which could be considered long and useless retries in this scenario,
because in 6 secs, the namenode hasn't declared the datanode dead.

Overall this happens when a datanode is dead during the first phase of the 
pipeline (file setups).
If a datanode is dead during the second phase (byte transfer), the DFSClient 
still
could proceed with the other surviving datanodes (which is consistent with what
Hadoop books always say -- the write should proceed if at least we have one good
datanode).  But unfortunately this specification is not true during the first 
phase of the
pipeline.  Overall we suggest that the namenode take into consideration the 
client's
view of unreachable datanodes.  That is, if a client says that it cannot reach 
DN-X,
then the namenode might give the client another node other than X (but the 
namenode
does not have to declare N dead). 

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1234) Datanode 'alive' but with its disk failed, Namenode thinks it's alive

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1234:
---

Description: 
- Summary: Datanode 'alive' but with its disk failed, Namenode still thinks 
it's alive
 
- Setups:
+ Replication = 1
+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ Failure type = bad disk
+ When/where failure happens = first phase of the pipeline
 
- Details:
In this experiment we have two datanodes. Each node has 1 disk.
However, if one datanode has a failed disk (but the node is still alive), the 
datanode
does not keep track of this.  From the perspective of the namenode,
that datanode is still alive, and thus the namenode gives back the same datanode
to the client.  The client will retry 3 times by asking the namenode to
give a new set of datanodes, and always get the same datanode.
And every time the client wants to write there, it gets an exception.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

  was:
- Summary: Datanode 'alive' but with its disk failed, Namenode still thinks 
it's alive
 
- Setups:
+ Replication = 1
+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ Failure type = bad disk
+ When/where failure happens = first phase of the pipeline
 
- Details:
In this experiment we have two datanodes. Each node has 1 disk.
However, if one datanode has a failed disk (but the node is still alive), the 
datanode
does not keep track of this.  From the perspective of the namenode,
that datanode is still alive, and thus the namenode gives back the same datanode
to the client.  The client will retry 3 times by asking the namenode to
give a new set of datanodes, and always get the same datanode.
And every time the client wants to write there, it gets an exception.


 Datanode 'alive' but with its disk failed, Namenode thinks it's alive
 -

 Key: HDFS-1234
 URL: https://issues.apache.org/jira/browse/HDFS-1234
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do

 - Summary: Datanode 'alive' but with its disk failed, Namenode still thinks 
 it's alive
  
 - Setups:
 + Replication = 1
 + # available datanodes = 2
 + # disks / datanode = 1
 + # failures = 1
 + Failure type = bad disk
 + When/where failure happens = first phase of the pipeline
  
 - Details:
 In this experiment we have two datanodes. Each node has 1 disk.
 However, if one datanode has a failed disk (but the node is still alive), the 
 datanode
 does not keep track of this.  From the perspective of the namenode,
 that datanode is still alive, and thus the namenode gives back the same 
 datanode
 to the client.  The client will retry 3 times by asking the namenode to
 give a new set of datanodes, and always get the same datanode.
 And every time the client wants to write there, it gets an exception.
 This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
 Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1236) Client uselessly retries recoverBlock 5 times

2010-06-16 Thread Thanh Do (JIRA)
Client uselessly retries recoverBlock 5 times
-

 Key: HDFS-1236
 URL: https://issues.apache.org/jira/browse/HDFS-1236
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20.1
Reporter: Thanh Do


Summary:
Client uselessly retries recoverBlock 5 times
The same behavior is also seen in append protocol (HDFS-1229)

The setup:
# available datanodes = 4
Replication factor = 2 (hence there are 2 datanodes in the pipeline)
Failure type = Bad disk at datanode (not crashes)
# failures = 2
# disks / datanode = 1
Where/when the failures happen: This is a scenario where each disk of the two 
datanodes in the pipeline go bad at the same time during the 2nd phase of the 
pipeline (the data transfer phase).
 
Details:
 
In this case, the client will call processDatanodeError
which will call datanode.recoverBlock to those two datanodes.
But since these two datanodes have bad disks (although they're still alive),
then recoverBlock() will fail.
For this one, the client's retry logic ends when streamer is closed (close == 
true).
But before this happen, the client will retry 5 times
(maxRecoveryErrorCount) and will fail all the time, until
it finishes.  What is interesting is that
during each retry, there is a wait of 1 second in
DataStreamer.run (i.e. dataQueue.wait(1000)).
So it will be a 5-second total wait before declaring it fails.
 
This is a different bug than HDFS-1235, where the client retries
3 times for 6 seconds (resulting in 25 seconds wait time).
In this experiment, what we get for the total wait time is only
12 seconds (not sure why it is 12). So the DFSClient quits without
contacting the namenode again (say to ask for a new set of
two datanodes).
So interestingly we find another
bug that shows client retry logic is complex and not deterministic
depending on where and when failures happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1236) Client uselessly retries recoverBlock 5 times

2010-06-16 Thread Thanh Do (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1236:
---

Description: 
Summary:
Client uselessly retries recoverBlock 5 times
The same behavior is also seen in append protocol (HDFS-1229)

The setup:
# available datanodes = 4
Replication factor = 2 (hence there are 2 datanodes in the pipeline)
Failure type = Bad disk at datanode (not crashes)
# failures = 2
# disks / datanode = 1
Where/when the failures happen: This is a scenario where each disk of the two 
datanodes in the pipeline go bad at the same time during the 2nd phase of the 
pipeline (the data transfer phase).
 
Details:
 
In this case, the client will call processDatanodeError
which will call datanode.recoverBlock to those two datanodes.
But since these two datanodes have bad disks (although they're still alive),
then recoverBlock() will fail.
For this one, the client's retry logic ends when streamer is closed (close == 
true).
But before this happen, the client will retry 5 times
(maxRecoveryErrorCount) and will fail all the time, until
it finishes.  What is interesting is that
during each retry, there is a wait of 1 second in
DataStreamer.run (i.e. dataQueue.wait(1000)).
So it will be a 5-second total wait before declaring it fails.
 
This is a different bug than HDFS-1235, where the client retries
3 times for 6 seconds (resulting in 25 seconds wait time).
In this experiment, what we get for the total wait time is only
12 seconds (not sure why it is 12). So the DFSClient quits without
contacting the namenode again (say to ask for a new set of
two datanodes).
So interestingly we find another
bug that shows client retry logic is complex and not deterministic
depending on where and when failures happen.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

  was:
Summary:
Client uselessly retries recoverBlock 5 times
The same behavior is also seen in append protocol (HDFS-1229)

The setup:
# available datanodes = 4
Replication factor = 2 (hence there are 2 datanodes in the pipeline)
Failure type = Bad disk at datanode (not crashes)
# failures = 2
# disks / datanode = 1
Where/when the failures happen: This is a scenario where each disk of the two 
datanodes in the pipeline go bad at the same time during the 2nd phase of the 
pipeline (the data transfer phase).
 
Details:
 
In this case, the client will call processDatanodeError
which will call datanode.recoverBlock to those two datanodes.
But since these two datanodes have bad disks (although they're still alive),
then recoverBlock() will fail.
For this one, the client's retry logic ends when streamer is closed (close == 
true).
But before this happen, the client will retry 5 times
(maxRecoveryErrorCount) and will fail all the time, until
it finishes.  What is interesting is that
during each retry, there is a wait of 1 second in
DataStreamer.run (i.e. dataQueue.wait(1000)).
So it will be a 5-second total wait before declaring it fails.
 
This is a different bug than HDFS-1235, where the client retries
3 times for 6 seconds (resulting in 25 seconds wait time).
In this experiment, what we get for the total wait time is only
12 seconds (not sure why it is 12). So the DFSClient quits without
contacting the namenode again (say to ask for a new set of
two datanodes).
So interestingly we find another
bug that shows client retry logic is complex and not deterministic
depending on where and when failures happen.

Component/s: hdfs client

 Client uselessly retries recoverBlock 5 times
 -

 Key: HDFS-1236
 URL: https://issues.apache.org/jira/browse/HDFS-1236
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do

 Summary:
 Client uselessly retries recoverBlock 5 times
 The same behavior is also seen in append protocol (HDFS-1229)
 The setup:
 # available datanodes = 4
 Replication factor = 2 (hence there are 2 datanodes in the pipeline)
 Failure type = Bad disk at datanode (not crashes)
 # failures = 2
 # disks / datanode = 1
 Where/when the failures happen: This is a scenario where each disk of the two 
 datanodes in the pipeline go bad at the same time during the 2nd phase of the 
 pipeline (the data transfer phase).
  
 Details:
  
 In this case, the client will call processDatanodeError
 which will call datanode.recoverBlock to those two datanodes.
 But since these two datanodes have bad disks (although they're still alive),
 then recoverBlock() will fail.
 For this one, the client's retry logic ends when streamer is closed (close == 
 true).
 But before this happen, the client will retry 5 times
 

[jira] Created: (HDFS-1238) A block is stuck in ongoingRecovery due to exception not propagated

2010-06-16 Thread Thanh Do (JIRA)
A block is stuck in ongoingRecovery due to exception not propagated 


 Key: HDFS-1238
 URL: https://issues.apache.org/jira/browse/HDFS-1238
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Setup:
+  # datanodes = 2
+ replication factor = 2
+ failure type = transient (i.e. a java I/O call that throws I/O Exception or 
returns false)
+ # failures = 2
+ When/where failures happen: (This is a subtle bug) The first failure is a 
transient failure at a datanode during the second phase. Due to the first 
failure, the DFSClient will call recoverBlock.  The second failure is injected 
during this recover block process (i.e. another failure during the recovery 
process).
 
- Details:
 
The expectation here is that since the DFSClient performs lots of retries,
two transient failures should be masked properly by the retries.
We found one case, where the failures are not transparent to the users.
 
Here's the stack trace of when/where the two failures happen (please ignore the 
line number).
 
1. The first failure:
Exception is thrown at
call(void java.io.DataOutputStream.flush())
SourceLoc: org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java(252)
Stack Trace:
  [0] datanode.BlockReceiver (flush:252)
  [1] datanode.BlockReceiver (receivePacket:660)
  [2] datanode.BlockReceiver (receiveBlock:743)
  [3] datanode.DataXceiver (writeBlock:468)
  [4] datanode.DataXceiver (run:119)
 
2. The second failure:
False is returned at
   call(boolean java.io.File.renameTo(File))
   SourceLoc: org/apache/hadoop/hdfs/server/datanode/FSDataset.java(105)
Stack Trace:
  [0] datanode.FSDataset (tryUpdateBlock:1008)
  [1] datanode.FSDataset (updateBlock:859)
  [2] datanode.DataNode (updateBlock:1780)
  [3] datanode.DataNode (syncBlock:2032)
  [4] datanode.DataNode (recoverBlock:1962)
  [5] datanode.DataNode (recoverBlock:2101)
 
This is what we found out:
The first failure causes the DFSClient to somehow calls recoverBlock,
which will force us to see the 2nd failure. The 2nd failure makes
renameTo returns false, which then causes an IOException to be thrown
from the function that calls renameTo.
But this IOException is not propagated properly!
It is dropped inside DN.syncBlock(). Specifically DN.syncBlock
calls DN.updateBlock() which gets the exception. But syncBlock
only catches that and prints a warning without propagating the exception
properly.  Thus syncBlock returns without any exception,
and thus recoverBlock returns without executing the finally{} block
(see below).
 
Now, the client will retry recoverBlock for 3-5 more times,
but this retries always see exceptions! The reason is that the first
time we call recoverBlock(blk), this blk is being put into
an ongoingRecovery list inside DN.recoverBlock().
Normally, blk is only removed (ongoingRecovery.remove(blk)) inside the 
finally{} block.
But since the exception is not propagated properly, this finally{}
block is never called, thus the blk is stuck
forever inside the ongoingRecovery list, and hence the next time
client performs the retry, it gets this error message
Block ... is already being recovered and recoverBlock() throws
IOException.  As a result, the client which calls this whole
process in the context of processDatanodeError will return
from the pde function with closed = true, and hence it never
retries the whole thing again from the beginning, and instead
just returns error.


This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.