HDFS VFS Driver

2010-06-16 Thread Michael D'Amour
We have an open source ETL tool (Kettle) which uses VFS for many
input/output steps/jobs.  We would like to be able to read/write HDFS
from Kettle using VFS.  
 
I haven't been able to find anything out there other than it would be
nice.
 
I had some time a few weeks ago to begin writing a VFS driver for HDFS
and we (Pentaho) would like to be able to contribute this driver.  I
believe it supports all the major file/folder operations and I have
written unit tests for all of these operations.  The code is currently
checked into an open Pentaho SVN repository under the Apache 2.0
license.  There are some current limitations, such as a lack of
authentication (kerberos), which appears to be coming in 0.22.0,
however, the driver supports username/password, but I just can't use
them yet.
 
Please let me know how to proceed with the contribution process.
 
Thank you.
-Mike


Re: HDFS VFS Driver

2010-06-16 Thread Arun C Murthy

Michael,

Please open a jira (new feature) and attach your patch there:
http://wiki.apache.org/hadoop/HowToContribute

thanks,
Arun

On Jun 16, 2010, at 8:55 AM, Michael D'Amour wrote:


We have an open source ETL tool (Kettle) which uses VFS for many
input/output steps/jobs.  We would like to be able to read/write HDFS
from Kettle using VFS.

I haven't been able to find anything out there other than it would be
nice.

I had some time a few weeks ago to begin writing a VFS driver for HDFS
and we (Pentaho) would like to be able to contribute this driver.  I
believe it supports all the major file/folder operations and I have
written unit tests for all of these operations.  The code is currently
checked into an open Pentaho SVN repository under the Apache 2.0
license.  There are some current limitations, such as a lack of
authentication (kerberos), which appears to be coming in 0.22.0,
however, the driver supports username/password, but I just can't use
them yet.

Please let me know how to proceed with the contribution process.

Thank you.
-Mike




Re: HDFS VFS Driver

2010-06-16 Thread Dhruba Borthakur
hi mike,

it will be nice to get a high level doc on what/how it is implemented.

also, you might want to compare it with fufs-dfs
http://wiki.apache.org/hadoop/MountableHDFS

thanks,
dhruba


On Wed, Jun 16, 2010 at 8:55 AM, Michael D'Amour mdam...@pentaho.comwrote:

 We have an open source ETL tool (Kettle) which uses VFS for many
 input/output steps/jobs.  We would like to be able to read/write HDFS
 from Kettle using VFS.

 I haven't been able to find anything out there other than it would be
 nice.

 I had some time a few weeks ago to begin writing a VFS driver for HDFS
 and we (Pentaho) would like to be able to contribute this driver.  I
 believe it supports all the major file/folder operations and I have
 written unit tests for all of these operations.  The code is currently
 checked into an open Pentaho SVN repository under the Apache 2.0
 license.  There are some current limitations, such as a lack of
 authentication (kerberos), which appears to be coming in 0.22.0,
 however, the driver supports username/password, but I just can't use
 them yet.

 Please let me know how to proceed with the contribution process.

 Thank you.
 -Mike




-- 
Connect to me at http://www.facebook.com/dhruba


[jira] Created: (HDFS-1213) Implement a VFS Driver for HDFS

2010-06-16 Thread Michael D'Amour (JIRA)
Implement a VFS Driver for HDFS
---

 Key: HDFS-1213
 URL: https://issues.apache.org/jira/browse/HDFS-1213
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Michael D'Amour


We have an open source ETL tool (Kettle) which uses VFS for many input/output 
steps/jobs.  We would like to be able to read/write HDFS from Kettle using VFS. 
 
 
I haven't been able to find anything out there other than it would be nice.
 
I had some time a few weeks ago to begin writing a VFS driver for HDFS and we 
(Pentaho) would like to be able to contribute this driver.  I believe it 
supports all the major file/folder operations and I have written unit tests for 
all of these operations.  The code is currently checked into an open Pentaho 
SVN repository under the Apache 2.0 license.  There are some current 
limitations, such as a lack of authentication (kerberos), which appears to be 
coming in 0.22.0, however, the driver supports username/password, but I just 
can't use them yet.

I will be attaching the code for the driver once the case is created.  The 
project does not modify existing hadoop/hdfs source.

Our JIRA case can be found at http://jira.pentaho.com/browse/PDI-4146

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1214) hdfs client metadata cache

2010-06-16 Thread Joydeep Sen Sarma (JIRA)
hdfs client metadata cache
--

 Key: HDFS-1214
 URL: https://issues.apache.org/jira/browse/HDFS-1214
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs client
Reporter: Joydeep Sen Sarma


In some applications, latency is affected by the cost of making rpc calls to 
namenode to fetch metadata. the most obvious case are calls to fetch 
file/directory status. applications like hive like to make optimizations based 
on file size/number etc. - and for such optimizations - 'recent' status data 
(as opposed to most up-to-date) is acceptable. in such cases, a cache on the 
DFS client that transparently caches metadata would be greatly benefit 
applications.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1215) TestNodeCount infinite loops on branch-20-append

2010-06-16 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1215.
---

  Assignee: Todd Lipcon
Resolution: Fixed

Dhruba committed to 20-append branch

 TestNodeCount infinite loops on branch-20-append
 

 Key: HDFS-1215
 URL: https://issues.apache.org/jira/browse/HDFS-1215
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: 
 0025-Fix-TestNodeCount-to-not-infinite-loop-after-HDFS-40.patch


 HDFS-409 made some minicluster changes, which got incorporated into one of 
 the earlier 20-append patches. This breaks TestNodeCount so it infinite loops 
 on the branch. This patch fixes it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1216) Update to JUnit 4 in branch 20 append

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1216.


Resolution: Fixed

I just committed this. Thanks Todd!

 Update to JUnit 4 in branch 20 append
 -

 Key: HDFS-1216
 URL: https://issues.apache.org/jira/browse/HDFS-1216
 Project: Hadoop HDFS
  Issue Type: Task
  Components: test
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: junit-4.5.txt


 A lot of the append tests are JUnit 4 style. We should upgrade in branch - 
 Junit 4 is entirely backward compatible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization

2010-06-16 Thread Todd Lipcon (JIRA)
20 append: Blocks recovered on startup should be treated with lower priority 
during block synchronization
-

 Key: HDFS-1218
 URL: https://issues.apache.org/jira/browse/HDFS-1218
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.20-append


When a datanode experiences power loss, it can come back up with truncated 
replicas (due to local FS journal replay). Those replicas should not be allowed 
to truncate the block during block synchronization if there are other replicas 
from DNs that have _not_ restarted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-142) In 0.20, move blocks being written into a blocksBeingWritten directory

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-142.
---

Resolution: Fixed

I have committed this. Thanks Sam, Nicolas and Todd.

 In 0.20, move blocks being written into a blocksBeingWritten directory
 --

 Key: HDFS-142
 URL: https://issues.apache.org/jira/browse/HDFS-142
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20-append
Reporter: Raghu Angadi
Assignee: dhruba borthakur
Priority: Blocker
 Fix For: 0.20-append

 Attachments: appendFile-recheck-lease.txt, appendQuestions.txt, 
 deleteTmp.patch, deleteTmp2.patch, deleteTmp5_20.txt, deleteTmp5_20.txt, 
 deleteTmp_0.18.patch, dont-recover-rwr-when-rbw-available.txt, 
 handleTmp1.patch, hdfs-142-commitBlockSynchronization-unknown-datanode.txt, 
 HDFS-142-deaddn-fix.patch, HDFS-142-finalize-fix.txt, 
 hdfs-142-minidfs-fix-from-409.txt, 
 HDFS-142-multiple-blocks-datanode-exception.patch, 
 hdfs-142-recovery-reassignment-and-bbw-cleanup.txt, hdfs-142-testcases.txt, 
 hdfs-142-testleaserecovery-fix.txt, HDFS-142_20-append2.patch, 
 HDFS-142_20.patch, recentInvalidateSets-assertion-fix.txt, 
 recover-rbw-v2.txt, testfileappend4-deaddn.txt, 
 validateBlockMetaData-synchronized.txt


 Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp  
 directory since these files are not valid anymore. But in 0.18 it moves these 
 files to normal directory incorrectly making them valid blocks. One of the 
 following would work :
 - remove the tmp files during upgrade, or
 - if the files under /tmp are in pre-18 format (i.e. no generation), delete 
 them.
 Currently effect of this bug is that, these files end up failing block 
 verification and eventually get deleted. But cause incorrect over-replication 
 at the namenode before that.
 Also it looks like our policy regd treating files under tmp needs to be 
 defined better. Right now there are probably one or two more bugs with it. 
 Dhruba, please file them if you rememeber.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1141) completeFile does not check lease ownership

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1141.


Resolution: Fixed

Pulled into hadoop-0.20-append

 completeFile does not check lease ownership
 ---

 Key: HDFS-1141
 URL: https://issues.apache.org/jira/browse/HDFS-1141
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 0.20-append, 0.22.0

 Attachments: hdfs-1141-branch20.txt, hdfs-1141.txt, hdfs-1141.txt


 completeFile should check that the caller still owns the lease of the file 
 that it's completing. This is for the 'testCompleteOtherLeaseHoldersFile' 
 case in HDFS-1139.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1207) 0.20-append: stallReplicationWork should be volatile

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1207.


Fix Version/s: 0.20-append
   Resolution: Fixed

I just committed this. Thanks Todd!

 0.20-append: stallReplicationWork should be volatile
 

 Key: HDFS-1207
 URL: https://issues.apache.org/jira/browse/HDFS-1207
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.20-append

 Attachments: hdfs-1207.txt


 the stallReplicationWork member in FSNamesystem is accessed by multiple 
 threads without synchronization, but isn't marked volatile. I believe this is 
 responsible for about 1% failure rate on 
 TestFileAppend4.testAppendSyncChecksum* on my 8-core test boxes (looking at 
 logs I see replication happening even though we've supposedly disabled it)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1210) DFSClient should log exception when block recovery fails

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1210.


Fix Version/s: 0.20-append
   Resolution: Fixed

I just committed this. Thanks Todd.

 DFSClient should log exception when block recovery fails
 

 Key: HDFS-1210
 URL: https://issues.apache.org/jira/browse/HDFS-1210
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs client
Affects Versions: 0.20-append, 0.20.2
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Trivial
 Fix For: 0.20-append

 Attachments: hdfs-1210.txt


 Right now we just retry without necessarily showing the exception. It can be 
 useful to see what the error was that prevented the recovery RPC from 
 succeeding.
 (I believe this only applies in 0.20 style of block recovery)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1211) 0.20 append: Block receiver should not log rewind packets at INFO level

2010-06-16 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur resolved HDFS-1211.


Resolution: Fixed

I just committed this. Thanks Todd!

 0.20 append: Block receiver should not log rewind packets at INFO level
 -

 Key: HDFS-1211
 URL: https://issues.apache.org/jira/browse/HDFS-1211
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor
 Fix For: 0.20-append

 Attachments: hdfs-1211.txt


 In the 0.20 append implementation, it logs an INFO level message for every 
 packet that rewinds the end of the block file. This is really noisy for 
 applications like HBase which sync every edit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1219) Data Loss due to edits log truncation

2010-06-16 Thread Thanh Do (JIRA)
Data Loss due to edits log truncation
-

 Key: HDFS-1219
 URL: https://issues.apache.org/jira/browse/HDFS-1219
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.2
Reporter: Thanh Do


We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1221) NameNode unable to start due to stale edits log after a crash

2010-06-16 Thread Thanh Do (JIRA)
NameNode unable to start due to stale edits log after a crash
-

 Key: HDFS-1221
 URL: https://issues.apache.org/jira/browse/HDFS-1221
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: 
If a crash happens during FSEditLog.createEditLogFile(), the
edits log file on disk may be stale. During next reboot, NameNode 
will get an exception when parsing the edits file, because of stale data, 
leading to unsuccessful reboot.
Note: This is just one example. Since we see that edits log (and fsimage)
does not have checksum, they are vulnerable to corruption too.
 
- Details:
The steps to create new edits log (which we infer from HDFS code) are:
1) truncate the file to zero size
2) write FSConstants.LAYOUT_VERSION to buffer
3) insert the end-of-file marker OP_INVALID to the end of the buffer
4) preallocate 1MB of data, and fill the data with 0
5) flush the buffer to disk
 
Note that only in step 1, 4, 5, the data on disk is actually changed.
Now, suppose a crash happens after step 4, but before step 5.
In the next reboot, NameNode will fetch this edits log file (which contains
all 0). The first thing parsed is the LAYOUT_VERSION, which is 0. This is OK,
because NameNode has code to handle that case.
(but we expect LAYOUT_VERSION to be -18, don't we). 
Now it parses the operation code, which happens to be 0. Unfortunately, since 0
is the value for OP_ADD, the NameNode expects some parameters corresponding 
to that operation. Now NameNode calls readString to read the path, which throws
an exception leading to a failed reboot.

We found this problem almost at the same time as HDFS developers.
Basically, the edits log is truncated before fsimage.ckpt is renamed to fsimage.
Hence, any crash happens after the truncation but before the renaming will lead
to a data loss. Detailed description can be found here:
https://issues.apache.org/jira/browse/HDFS-955
This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1225) Block lost when primary crashes in recoverBlock

2010-06-16 Thread Thanh Do (JIRA)
Block lost when primary crashes in recoverBlock
---

 Key: HDFS-1225
 URL: https://issues.apache.org/jira/browse/HDFS-1225
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: Block is lost if primary datanode crashes in the middle 
tryUpdateBlock.
 
- Setup:
# available datanode = 2
# replica = 2
# disks / datanode = 1
# failures = 1
# failure type = crash
When/where failure happens = (see below)
 
- Details:
 Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary.
Client appends to blk_X_1001 and crash happens during dn1.recoverBlock,
at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002
**Interesting**, this case, the block X is lost eventually. Why?
After dn1.recoverBlock crashes at rename, what left at dn1 current directory is:
1) blk_X

 
2) blk_X_1001.meta_tmp1002
== this is an invalid block, because it has no meta file associated with it.
dn2 (after dn1 crash) now contains:
1) blk_X

 
2) blk_X_1002.meta
(note that the rename at dn2 is completed, because dn1 called dn2.updateBlock() 
before
calling its own updateBlock())
But the command namenode.commitBlockSynchronization is not reported to namenode,
because dn1 is crashed. Therefore, from namenode point of view, the block X has 
GS 1001.
Hence, the block is lost.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1227) UpdateBlock fails due to unmatched file length

2010-06-16 Thread Thanh Do (JIRA)
UpdateBlock fails due to unmatched file length
--

 Key: HDFS-1227
 URL: https://issues.apache.org/jira/browse/HDFS-1227
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: client append is not atomic, hence, it is possible that
when retrying during append, there is an exception in updateBlock
indicating unmatched file length, making append failed.
 
- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 2
+ failure type = bad disk
+ When/where failure happens = (see below)
+ This bug is non-deterministic, to reproduce it, add a sufficient sleep before 
out.write() in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3
 
- Details:
 Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, 
dn2, dn3.
Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds.
Client starts sending data to the dn3 - the first datanode in pipeline.
dn3 forwards the packet to downstream datanodes, and starts writing
data to its disk. Suppose there is an exception in dn3 when writing to disk.
Client gets the exception, it starts the recovery code by calling 
dn1.recoverBlock() again.
dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build the 
syncList.
Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and dn2),
the previous packet (which is sent from dn3) has not come to disk yet.
Hence, the block Info given by getMetaDataInfo contains the length of 16 bytes.
But after that, the packet comes to disk, making the block file length now 
becomes 32 bytes.
Using the syncList (with contains block info with length 16 byte), dn1 calls 
updateBlock at
dn2 and dn1, which will failed, because the length of new block info (given by 
updateBlock,
which is 16 byte) does not match with its actual length on disk (which is 32 
byte)
 
Note that this bug is non-deterministic. Its depends on the thread interleaving
at datanodes.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1228) CRC does not match when retrying appending a partial block

2010-06-16 Thread Thanh Do (JIRA)
CRC does not match when retrying appending a partial block
--

 Key: HDFS-1228
 URL: https://issues.apache.org/jira/browse/HDFS-1228
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: when appending to partial block, if is possible that
retrial when facing an exception fails due to a checksum mismatch.
Append operation is not atomic (either complete or fail completely).
 
- Setup:
+ # available datanodes = 2
+# disks / datanode = 1
+ # failures = 1
+ failure type = bad disk
+ When/where failure happens = (see below)
 
- Details:
Client writes 16 bytes to dn1 and dn2. Write completes. So far so good.
The meta file now contains: 7 bytes header + 4 byte checksum (CK1 -
checksum for 16 byte) Client then appends 16 bytes more, and let assume there 
is an
exception at BlockReceiver.receivePacket() at dn2. So the client knows dn2
is bad. BUT, the append at dn1 is complete (i.e the data portion and checksum 
portion
has been made to disk to the corresponding block file and meta file), meaning 
that the
checksum file at dn1 now contains 7 bytes header + 4 byte checksum (CK2 - this 
is
checksum for 32 byte data). Because dn2 has an exception, client calls 
recoverBlock and
starts append again to dn1. dn1 receives 16 byte data, it verifies if the 
pre-computed
crc (CK2) matches what we recalculate just now (CK1), which obviously does not 
match.
Hence an exception and retrial fails.
 
- a similar bug has been reported at
https://issues.apache.org/jira/browse/HDFS-679
but here, it manifests in different context.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1230) BlocksMap.blockinfo is not getting cleared immediately after deleting a block.This will be cleared only after block report comes from the datanode.Why we need to maintain t

2010-06-16 Thread Gokul (JIRA)
BlocksMap.blockinfo is not getting cleared immediately after deleting a 
block.This will be cleared only after block report comes from the datanode.Why 
we need to maintain the blockinfo till that time.


 Key: HDFS-1230
 URL: https://issues.apache.org/jira/browse/HDFS-1230
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.20.1
Reporter: Gokul


BlocksMap.blockinfo is not getting cleared immediately after deleting a 
block.This will be cleared only after block report comes from the datanode.Why 
we need to maintain the blockinfo till that time It increases namenode 
memory unnecessarily. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1231) Generation Stamp mismatches, leading to failed append

2010-06-16 Thread Thanh Do (JIRA)
Generation Stamp mismatches, leading to failed append
-

 Key: HDFS-1231
 URL: https://issues.apache.org/jira/browse/HDFS-1231
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: the recoverBlock is not atomic, leading retrial fails when 
facing a failure.
 
- Setup:
+ # available datanodes = 3
+ # disks / datanode = 1
+ # failures = 2
+ failure type = crash
+ When/where failure happens = (see below)
 
- Details:
Suppose there are 3 datanodes in the pipeline: dn3, dn2, and dn1. Dn1 is 
primary.
When appending, client first calls dn1.recoverBlock to make all the datanodes 
in 
pipeline agree on the new Generation Stamp (GS1) and the length of the block.
Client then sends a data packet to dn3. dn3 in turn forwards this packet to 
down stream
dns (dn2 and dn1) and starts writing to its own disk, then it crashes AFTER 
writing to the block
file but BEFORE writing to the meta file. Client notices the crash, it calls 
dn1.recoverBlock().
dn1.recoverBlock() first creates a syncList (by calling getMetadataInfo at all 
dn2 and dn1).
Then dn1 calls NameNode.getNextGS() to get new Generation Stamp (GS2).
Then it calls dn2.updateBlock(), this returns successfully.
Now, it starts calling its own updateBlock and crashes after renaming from
blk_X_GS1.meta to blk_X_GS1.meta_tmpGS2.
Therefore, dn1.recoverBlock() from the client point of view fails.
but the GS for corresponding block has been incremented in the namenode (GS2)
The client retries by calling dn2.recoverBlock with old GS (GS1), which does 
not match with
the new GS at the NameNode (GS1) --exception, leading to append fails.
 
Now, after all, we have
- in dn3 (which is crashed)
tmp/blk_X
tmp/blk_X_GS1.meta
- in dn2
current/blk_X
current/blk_X_GS2
- in dn1:
current/blk_X
current/blk_X_GS1.meta_tmpGS2
- in NameNode, the block X has generation stamp GS1 (because dn1 has not called
commitSyncronization yet).
 
Therefore, when crashed datanodes restart, at dn1 the block is invalid because 
there is no meta file. In dn3, block file and meta file are finalized, however, 
the 
block is corrupted because CRC mismatch. In dn2, the GS of the block is GS2,
which is not equal with the generation stamp info of the block maintained in 
NameNode.
Hence, the block blk_X is inaccessible.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1233) Bad retry logic at DFSClient

2010-06-16 Thread Thanh Do (JIRA)
Bad retry logic at DFSClient


 Key: HDFS-1233
 URL: https://issues.apache.org/jira/browse/HDFS-1233
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: failover bug, bad retry logic at DFSClient, cannot failover to the 
2nd disk
 
- Setups:
+ # available datanodes = 1
+ # disks / datanode = 2
+ # failures = 1
+ failure type = bad disk
+ When/where failure happens = (see below)
 
- Details:

The setup is:
1 datanode, 1 replica, and each datanode has 2 disks (Disk1 and Disk2).
 
We injected a single disk failure to see if we can failover to the
second disk or not.
 
If a persistent disk failure happens during createBlockOutputStream
(the first phase of the pipeline creation) (e.g. say DN1-Disk1 is bad),
then createBlockOutputStream (cbos) will get an exception and it
will retry!  When it retries it will get the same DN1 from the namenode,
and then DN1 will call DN.writeBlock(), FSVolume.createTmpFile,
and finally getNextVolume() which a moving volume#.  Thus, on the
second try, the write will be successfully go to the second disk.
So essentially createBlockOutputStream is wrapped in a
do/while(retry  --count = 0). The first cbos will fail, the second
will be successful in this particular scenario.
 
NOW, say cbos is successful, but the failure is persistent.
Then the retry is in a different while loop.
First, hasError is set to true in RP.run (responder packet).
Thus, DataStreamer.run() will go back to the loop:
while(!closed  clientRunning  !lastPacketInBlock).
Now this second iteration of the loop will call
processDatanodeError because hasError has been set to true.
In processDatanodeError (pde), the client sees that this is the only datanode
in the pipeline, and hence it considers that the node is bad! Although actually
only 1 disk is bad!  Hence, pde throws IOException suggesting
all the datanodes (in this case, only DN1) in the pipeline is bad.
Hence, in this error, the exception is thrown to the client.
But if the exception, say, is catched by the most outer while loop
do-while(retry  --count = 0), then this outer retry will be
successful then (as suggested in the previous paragraph).
 
In summary, if in a deployment scenario, we only have one datanode
that has multiple disks, and one disk goes bad, then the current
retry logic at the DFSClient side is not robust enough to mask the
failure from the client.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1232) Corrupted block if a crash happens before writing to checksumOut but after writing to dataOut

2010-06-16 Thread Thanh Do (JIRA)
Corrupted block if a crash happens before writing to checksumOut but after 
writing to dataOut
-

 Key: HDFS-1232
 URL: https://issues.apache.org/jira/browse/HDFS-1232
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: block is corrupted if a crash happens before writing to checksumOut 
but
after writing to dataOut. 
 
- Setup:
+ # available datanodes = 1
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+When/where failure happens = (see below)
 
- Details:
The order of processing a packet during client write/append at datanode
is first forward the packet to downstream, then write to data the block file, 
and 
and finally, write to the checksum file. Hence if a crash happens BEFORE the 
write
to checksum file but AFTER the write to data file, the block is corrupted.
Worse, if this is the only available replica, the block is lost.
 
We also found this problem in case there are 3 replicas for a particular block,
and during append, there are two failures. (see HDFS-1231)

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1234) Datanode 'alive' but with its disk failed, Namenode thinks it's alive

2010-06-16 Thread Thanh Do (JIRA)
Datanode 'alive' but with its disk failed, Namenode thinks it's alive
-

 Key: HDFS-1234
 URL: https://issues.apache.org/jira/browse/HDFS-1234
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: Datanode 'alive' but with its disk failed, Namenode still thinks 
it's alive
 
- Setups:
+ Replication = 1
+ # available datanodes = 2
+ # disks / datanode = 1
+ # failures = 1
+ Failure type = bad disk
+ When/where failure happens = first phase of the pipeline
 
- Details:
In this experiment we have two datanodes. Each node has 1 disk.
However, if one datanode has a failed disk (but the node is still alive), the 
datanode
does not keep track of this.  From the perspective of the namenode,
that datanode is still alive, and thus the namenode gives back the same datanode
to the client.  The client will retry 3 times by asking the namenode to
give a new set of datanodes, and always get the same datanode.
And every time the client wants to write there, it gets an exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1235) Namenode returning the same Datanode to client, due to infrequent heartbeat

2010-06-16 Thread Thanh Do (JIRA)
Namenode returning the same Datanode to client, due to infrequent heartbeat
---

 Key: HDFS-1235
 URL: https://issues.apache.org/jira/browse/HDFS-1235
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Reporter: Thanh Do


This bug has been reported.
Basically since datanode's hearbeat messages are infrequent (~ every 10 
minutes),
NameNode always gives the client the same datanode even if the datanode is dead.
 
We want to point out that the client wait 6 seconds before retrying,
which could be considered long and useless retries in this scenario,
because in 6 secs, the namenode hasn't declared the datanode dead.

Overall this happens when a datanode is dead during the first phase of the 
pipeline (file setups).
If a datanode is dead during the second phase (byte transfer), the DFSClient 
still
could proceed with the other surviving datanodes (which is consistent with what
Hadoop books always say -- the write should proceed if at least we have one good
datanode).  But unfortunately this specification is not true during the first 
phase of the
pipeline.  Overall we suggest that the namenode take into consideration the 
client's
view of unreachable datanodes.  That is, if a client says that it cannot reach 
DN-X,
then the namenode might give the client another node other than X (but the 
namenode
does not have to declare N dead). 

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1236) Client uselessly retries recoverBlock 5 times

2010-06-16 Thread Thanh Do (JIRA)
Client uselessly retries recoverBlock 5 times
-

 Key: HDFS-1236
 URL: https://issues.apache.org/jira/browse/HDFS-1236
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.20.1
Reporter: Thanh Do


Summary:
Client uselessly retries recoverBlock 5 times
The same behavior is also seen in append protocol (HDFS-1229)

The setup:
# available datanodes = 4
Replication factor = 2 (hence there are 2 datanodes in the pipeline)
Failure type = Bad disk at datanode (not crashes)
# failures = 2
# disks / datanode = 1
Where/when the failures happen: This is a scenario where each disk of the two 
datanodes in the pipeline go bad at the same time during the 2nd phase of the 
pipeline (the data transfer phase).
 
Details:
 
In this case, the client will call processDatanodeError
which will call datanode.recoverBlock to those two datanodes.
But since these two datanodes have bad disks (although they're still alive),
then recoverBlock() will fail.
For this one, the client's retry logic ends when streamer is closed (close == 
true).
But before this happen, the client will retry 5 times
(maxRecoveryErrorCount) and will fail all the time, until
it finishes.  What is interesting is that
during each retry, there is a wait of 1 second in
DataStreamer.run (i.e. dataQueue.wait(1000)).
So it will be a 5-second total wait before declaring it fails.
 
This is a different bug than HDFS-1235, where the client retries
3 times for 6 seconds (resulting in 25 seconds wait time).
In this experiment, what we get for the total wait time is only
12 seconds (not sure why it is 12). So the DFSClient quits without
contacting the namenode again (say to ask for a new set of
two datanodes).
So interestingly we find another
bug that shows client retry logic is complex and not deterministic
depending on where and when failures happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.