[jira] Updated: (HDFS-1597) Batched edit log syncs can reset synctxid throw assertions

2011-01-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1597:
--

Status: Patch Available  (was: Open)

> Batched edit log syncs can reset synctxid throw assertions
> --
>
> Key: HDFS-1597
> URL: https://issues.apache.org/jira/browse/HDFS-1597
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: hdfs-1597.txt, illustrate-test-failure.txt
>
>
> The top of FSEditLog.logSync has the following assertion:
> {code}
> assert editStreams.size() > 0 : "no editlog streams";
> {code}
> which should actually come after checking to see if the sync was already 
> batched in by another thread.
> This is related to a second bug in which the same case causes synctxid to be 
> reset to 0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986861#action_12986861
 ] 

dhruba borthakur commented on HDFS-1595:


It appears that Todd's proposal could work well to avoid this issue, do you 
agree Nicholas?

It appears to me (please correct me if I am wrong) to be a data availability 
problem. The replicas F is actually still intact and the data is good there, it 
is just that clients are unable to read that data. is it true that if the 
network card on F gets fixed, the data becomes available once again? 

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.  Reading is working fine for any data size.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails for the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1593) Allow a datanode to copy a block to a datanode on a foreign HDFS cluster.

2011-01-25 Thread Sanjay Radia (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986852#action_12986852
 ] 

Sanjay Radia commented on HDFS-1593:


 In the case of the NN issuing a copy operation,  the data does not go via the 
NN, but is transferred directly;  the two DNs do have to ensure that the other 
peer is indeed a DN in the cluster; I need to look at the code more closely but 
I believe the DN generates an access token since it shares the access token 
secret with the NN.

Dhruba, you are asserting that the two clusters have the same principals; while 
this may be true in many cases it may not always be true in all environments. 
Further the secret that is used to generate access tokens is not the same in 
two different clusters (even if they have use the same principal). 

BTW we have been looking at the same problem here at Yahoo and are trying to 
figure out the best secure solution.
There is another issue you are missing; the block sizes on the two clusters may 
not be the same; their default block sizes may be different. Hence I am not 
sure if one can simply copy a block across. 
Q. are you trying to push or pull the data? In order to handle different block 
sizes it seems easier to pull the data. One choice is to access a byte range in 
a file via the DfsClient; the other choice is to get the block's bytes from 
multiple DNs in the remote src cluster. 

I do however agree that transferring data directly from one or more data nodes 
to another is desirable.



> Allow a datanode to copy a block to a datanode on a foreign HDFS cluster.
> -
>
> Key: HDFS-1593
> URL: https://issues.apache.org/jira/browse/HDFS-1593
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: copyBlockTrunk1.txt
>
>
> This patch introduces an RPC to the datanode to allow it to copy a block to a 
> datanode on a remote HDFS cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms

2011-01-25 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986815#action_12986815
 ] 

Jitendra Nath Pandey commented on HDFS-1580:


> Is HDFS-1557 almost ready to go? (should I take a look at it?) 
  Yes, almost! I am done with my review and Suresh is also taking a look at the 
patch, once his review is done I will proceed to commit it. You are welcome to 
take a look.

> Add interface for generic Write Ahead Logging mechanisms
> 
>
> Key: HDFS-1580
> URL: https://issues.apache.org/jira/browse/HDFS-1580
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
> Attachments: generic_wal_iface.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1597) Batched edit log syncs can reset synctxid throw assertions

2011-01-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1597:
--

Attachment: hdfs-1597.txt

Here's a patch containing a fix and also two new unit tests that verify the 
edit-batching behavior.

> Batched edit log syncs can reset synctxid throw assertions
> --
>
> Key: HDFS-1597
> URL: https://issues.apache.org/jira/browse/HDFS-1597
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: hdfs-1597.txt, illustrate-test-failure.txt
>
>
> The top of FSEditLog.logSync has the following assertion:
> {code}
> assert editStreams.size() > 0 : "no editlog streams";
> {code}
> which should actually come after checking to see if the sync was already 
> batched in by another thread.
> This is related to a second bug in which the same case causes synctxid to be 
> reset to 0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1597) Batched edit log syncs can reset synctxid throw assertions

2011-01-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1597:
--

Description: 
The top of FSEditLog.logSync has the following assertion:
{code}
assert editStreams.size() > 0 : "no editlog streams";
{code}
which should actually come after checking to see if the sync was already 
batched in by another thread.

This is related to a second bug in which the same case causes synctxid to be 
reset to 0

  was:
The top of FSEditLog.logSync has the following assertion:
{code}
assert editStreams.size() > 0 : "no editlog streams";
{code}
which should actually come after checking to see if the sync was already 
batched in by another thread.

Will describe the race in a comment.

Summary: Batched edit log syncs can reset synctxid throw assertions  
(was: Misplaced assertion in FSEditLog.logSync)

Updated description to reflect the other problem as well

> Batched edit log syncs can reset synctxid throw assertions
> --
>
> Key: HDFS-1597
> URL: https://issues.apache.org/jira/browse/HDFS-1597
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: illustrate-test-failure.txt
>
>
> The top of FSEditLog.logSync has the following assertion:
> {code}
> assert editStreams.size() > 0 : "no editlog streams";
> {code}
> which should actually come after checking to see if the sync was already 
> batched in by another thread.
> This is related to a second bug in which the same case causes synctxid to be 
> reset to 0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986788#action_12986788
 ] 

Koji Noguchi commented on HDFS-1595:


bq. So this faulty node F has no problem receiving large amount of data? 

This faulty node had problem sending/receiving large amount of data and failing 
most of the time. 
Bigger the data, higher the chances of the failures.  I think smaller data 
(,say less than 1MB) was going through 99% of the time.
So heartbeat, ack and so forth were probably working.

When I tried to scp some blocks out from this node for data recovery, it kept 
on failing with 

===
blk_-1131935611740137990%   
 0 0.0KB/s   --:-- ETA
Corrupted MAC on input.
Finished discarding for aa.bb.cc.dd
lost connection

===

So I believe *most* of the dfsclient write was failing when going through this 
node.
And when it successfully went through (after hundreds of write attempts for 
different blocks), it would then fail on all the following replications but 
succeed on 'close' with 1 replica leading to this bug.


> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.  Reading is working fine for any data size.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails for the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1158) HDFS-457 increases the chances of losing blocks

2011-01-25 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986783#action_12986783
 ] 

Eli Collins commented on HDFS-1158:
---

Good suggestion Owen. How about the following?

* A dn should decommission itself rather than shutdown whenever (a) the 
configured threshold of disk failures has been reached or (b) when a critical 
volume (specified in the config, eg the volume(s) that host the logs, pid, tmp 
etc) has failed. In practice an admin would specify a number of volume failures 
should be tolerated and specify the root volume as critical.

* The configured failed.volumes.tolerated should be respected on startup. The 
datanode should only refuse to startup if more than failed.volumes.tolerated 
are failed, or if a configured critical volume has failed (which is probably 
not an issue in practice since dn startup probably fails eg if the root volume 
has gone readonly). 


>  HDFS-457 increases the chances of losing blocks
> 
>
> Key: HDFS-1158
> URL: https://issues.apache.org/jira/browse/HDFS-1158
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.21.0
>Reporter: Koji Noguchi
> Attachments: rev-HDFS-457.patch
>
>
> Whenever we restart a cluster, there's a chance of losing some blocks if more 
> than three datanodes don't come up.
> HDFS-457 increases this chance by keeping the datanodes up even when 
># /tmp disk goes read-only
># /disk0 that is used for storing PID goes read-only 
> and probably more.
> In our environment, /tmp and /disk0 are from the same device.
> When trying to restart a datanode, it would fail with
> 1) 
> {noformat}
> 2010-05-15 05:45:45,575 WARN org.mortbay.log: tmpdir
> java.io.IOException: Read-only file system
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.checkAndCreate(File.java:1704)
> at java.io.File.createTempFile(File.java:1792)
> at java.io.File.createTempFile(File.java:1828)
> at 
> org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745)
> {noformat}
> or 
> 2) 
> {noformat}
> hadoop-daemon.sh: line 117: /disk/0/hadoop-datanodecom.out: Read-only 
> file system
> hadoop-daemon.sh: line 118: /disk/0/hadoop-datanode.pid: Read-only file system
> {noformat}
> I can recover the missing blocks but it takes some time.
> Also, we are losing track of block movements since log directory can also go 
> to read-only but datanode would continue running.
> For 0.21 release, can we revert HDFS-457 or make it configurable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1469) TestBlockTokenWithDFS fails on trunk

2011-01-25 Thread Konstantin Boudnik (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986781#action_12986781
 ] 

Konstantin Boudnik commented on HDFS-1469:
--

Forgot to mention, that the timeout happens on 0.20.2 based release.

> TestBlockTokenWithDFS fails on trunk
> 
>
> Key: HDFS-1469
> URL: https://issues.apache.org/jira/browse/HDFS-1469
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Priority: Blocker
> Attachments: failed-TestBlockTokenWithDFS.txt, log.gz
>
>
> TestBlockTokenWithDFS is failing on trunk:
> Testcase: testAppend took 31.569 sec
>   FAILED
> null
> junit.framework.AssertionFailedError: null
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS.testAppend(TestBlockTokenWithDFS.java:223)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1469) TestBlockTokenWithDFS fails on trunk

2011-01-25 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-1469:
-

Attachment: log.gz

I have ran slightly modified test (converted to JUnit 4 with better assertion 
messages, etc.) and full test output turned on in a loop and got the timeout 
(attaching the log)

> TestBlockTokenWithDFS fails on trunk
> 
>
> Key: HDFS-1469
> URL: https://issues.apache.org/jira/browse/HDFS-1469
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Priority: Blocker
> Attachments: failed-TestBlockTokenWithDFS.txt, log.gz
>
>
> TestBlockTokenWithDFS is failing on trunk:
> Testcase: testAppend took 31.569 sec
>   FAILED
> null
> junit.framework.AssertionFailedError: null
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS.testAppend(TestBlockTokenWithDFS.java:223)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1597) Misplaced assertion in FSEditLog.logSync

2011-01-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986773#action_12986773
 ] 

Todd Lipcon commented on HDFS-1597:
---

bq. As of HDFS-119 syntxid is set in the finally block, even if the sync was 
batched

Sorry, didn't say that very clearly. If the sync is batched, it will set 
{{synctxid}} to *0* in the {{finally}} block! So the next thread comes along, 
doesn't think it has been batched (though it has) and do yet another sync.

> Misplaced assertion in FSEditLog.logSync
> 
>
> Key: HDFS-1597
> URL: https://issues.apache.org/jira/browse/HDFS-1597
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: illustrate-test-failure.txt
>
>
> The top of FSEditLog.logSync has the following assertion:
> {code}
> assert editStreams.size() > 0 : "no editlog streams";
> {code}
> which should actually come after checking to see if the sync was already 
> batched in by another thread.
> Will describe the race in a comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986774#action_12986774
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1595:
--

> Resurrecting the pipeline with more replicas is a nice idea but I imagine it 
> will be super-complicated, no?

Yes, it is complicated.  Nonetheless, it is invaluable for this JIRA, append 
and other applications like the HBase use case you provided.

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.  Reading is working fine for any data size.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails for the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1597) Misplaced assertion in FSEditLog.logSync

2011-01-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986770#action_12986770
 ] 

Todd Lipcon commented on HDFS-1597:
---

Actually in trunk there's a second bug that affects this area of the code. As 
of HDFS-119 {{syntxid}} is set in the {{finally}} block, even if the sync was 
batched. So, in this case if there are two threads acting like "Thread A" in my 
example, then the second one will actually fall past the batching check and 
trigger the assertion later as well (or even lose edits!)


> Misplaced assertion in FSEditLog.logSync
> 
>
> Key: HDFS-1597
> URL: https://issues.apache.org/jira/browse/HDFS-1597
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: illustrate-test-failure.txt
>
>
> The top of FSEditLog.logSync has the following assertion:
> {code}
> assert editStreams.size() > 0 : "no editlog streams";
> {code}
> which should actually come after checking to see if the sync was already 
> batched in by another thread.
> Will describe the race in a comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1597) Misplaced assertion in FSEditLog.logSync

2011-01-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-1597:
--

Attachment: illustrate-test-failure.txt

Here's a little hack I did that makes the test fail reliably with this error:

Caused by: java.lang.AssertionError: no editlog streams
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:485)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2071)
at 
org.apache.hadoop.hdfs.server.namenode.TestEditLogRace$Transactions.run(TestEditLogRace.java:115)
at java.lang.Thread.run(Thread.java:662)

(obviously not for commit, just to trigger the race)

> Misplaced assertion in FSEditLog.logSync
> 
>
> Key: HDFS-1597
> URL: https://issues.apache.org/jira/browse/HDFS-1597
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: illustrate-test-failure.txt
>
>
> The top of FSEditLog.logSync has the following assertion:
> {code}
> assert editStreams.size() > 0 : "no editlog streams";
> {code}
> which should actually come after checking to see if the sync was already 
> batched in by another thread.
> Will describe the race in a comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1597) Misplaced assertion in FSEditLog.logSync

2011-01-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986762#action_12986762
 ] 

Todd Lipcon commented on HDFS-1597:
---

The race is the following:

||Thread A||Thread B||
|mkdirs() | - |
| take FSN lock | - |
| ..logEdit() | - |
| drop FSN lock | - |
| - | enterSafeMode() |
| - | saveNamespace() |
| - | ..logSyncAll() |
| - | ..editLog.close() |
| logSync() | - |

In this case, because Thread A's transaction has already been synced in 
logSyncAll, it doesn't actually have any work to sync - i.e it got batched. 
Accordingly, it's fine that the edit log is closed. But, the assertion comes 
before the check that the sync was already batched, so it fires.

This causes occasional failures of TestEditLog on one of our hudson builds now 
that assertions are enabled.

> Misplaced assertion in FSEditLog.logSync
> 
>
> Key: HDFS-1597
> URL: https://issues.apache.org/jira/browse/HDFS-1597
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.22.0
>
>
> The top of FSEditLog.logSync has the following assertion:
> {code}
> assert editStreams.size() > 0 : "no editlog streams";
> {code}
> which should actually come after checking to see if the sync was already 
> batched in by another thread.
> Will describe the race in a comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1597) Misplaced assertion in FSEditLog.logSync

2011-01-25 Thread Todd Lipcon (JIRA)
Misplaced assertion in FSEditLog.logSync


 Key: HDFS-1597
 URL: https://issues.apache.org/jira/browse/HDFS-1597
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.22.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: 0.22.0


The top of FSEditLog.logSync has the following assertion:
{code}
assert editStreams.size() > 0 : "no editlog streams";
{code}
which should actually come after checking to see if the sync was already 
batched in by another thread.

Will describe the race in a comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1582) Remove auto-generated native build files

2011-01-25 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986743#action_12986743
 ] 

Eli Collins commented on HDFS-1582:
---

Patch looks good.  What testing has been done to check the native part of the 
build, eg run libhdfs or fuse-dfs?

> Remove auto-generated native build files
> 
>
> Key: HDFS-1582
> URL: https://issues.apache.org/jira/browse/HDFS-1582
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: contrib/libhdfs
>Reporter: Roman Shaposhnik
>Assignee: Roman Shaposhnik
> Fix For: 0.23.0
>
> Attachments: HADOOP-6436.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The repo currently includes the automake and autoconf generated files for the 
> native build. Per discussion on HADOOP-6421 let's remove them and use the 
> host's automake and autoconf. We should also do this for libhdfs and 
> fuse-dfs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms

2011-01-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986718#action_12986718
 ] 

Todd Lipcon commented on HDFS-1580:
---

Above sounds reasonable with respect to 1073. Is HDFS-1557 almost ready to go? 
(should I take a look at it?)

> Add interface for generic Write Ahead Logging mechanisms
> 
>
> Key: HDFS-1580
> URL: https://issues.apache.org/jira/browse/HDFS-1580
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
> Attachments: generic_wal_iface.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms

2011-01-25 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986709#action_12986709
 ] 

Jitendra Nath Pandey commented on HDFS-1580:


   The interface also needs to have a counterpart of roll-edits method. 
Currently, for checkpointing, first thing being done is to roll the edit logs 
i.e. an edits.new is created. As I understand, in hdfs-1073 instead of 
edits.new the edit files will be numbered. At least FileWriteAheadLog will need 
to roll to keep edit files from getting too big, even if it is not required for 
checkpointing. 
   
   The interface should also provide methods to get all previously rotated edit 
log files (or ledgers) and also current "in-progress" edit log file or ledger. 

   As a suggestion, the interface could have a concept of log handles, where 
each handle uniquely corresponds to single edit log file or ledger. Thus, we 
could have a method getAllLogs and it will return a list of log-handles.
I think ordered handles will fit with hdfs-1073 model (need to confirm). 
LogHandle can also have some meta data for example first transaction id, 
whether its current or old etc. Hdfs-1073 is proposing to store first 
transaction-id in the edit-file name itself, which could be used to populate 
the log-handle in case of FileWriteAheadLog. The input and output streams 
should be in the LogHandle, so that any log-file can be read. Log-Handle for 
older files should not let one create an output stream.  

  A method to purge the editlogs might also be needed, i.e. given a handle 
remove the corresponding log-file (or ledger).



> Add interface for generic Write Ahead Logging mechanisms
> 
>
> Key: HDFS-1580
> URL: https://issues.apache.org/jira/browse/HDFS-1580
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ivan Kelly
> Attachments: generic_wal_iface.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986634#action_12986634
 ] 

Todd Lipcon commented on HDFS-1595:
---

Another option not mentioned above that we use in HBase is to periodically poll 
the pipeline from the "application code" to find out how many replicas are in 
it (there's an API to do that in trunk). If we see the pipeline drop below 3 
replicas, we roll the output to a new file (hence new pipeline). For something 
like a commit log where we really just care about a stream of records, the 
actual file boundaries have no semantic meaning, so rolling is "free" an we get 
a full pipeline.

For MR it's a bit trickier to do that in general, but might be useful for 
certain output formats if we can figure out how to make the APIs non-gross. eg 
if a reducer is writing part-0 and gets a pipeline failure it could 
continue writing to part-0.1 or something.

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.  Reading is working fine for any data size.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails for the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks

2011-01-25 Thread Ken Goodhope (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Goodhope updated HDFS-863:
--

Attachment: HDFS-863.patch

Finally got my auto format set up right and it found a couple more issues my 
eyes missed.  Only formatted the sections I worked on.

> Potential deadlock in TestOverReplicatedBlocks
> --
>
> Key: HDFS-863
> URL: https://issues.apache.org/jira/browse/HDFS-863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Todd Lipcon
>Assignee: Ken Goodhope
> Fix For: 0.23.0
>
> Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, 
> HDFS-863.patch, HDFS-863.patch, TestNodeCount.png
>
>
> TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on 
> namesystem.heartbeats without synchronizing on namesystem first. Other places 
> in the code synchronize namesystem, then heartbeats. It's probably unlikely 
> to occur in this test case, but it's a simple fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986624#action_12986624
 ] 

Kan Zhang commented on HDFS-1595:
-

Looks like we have 3 options when the pipeline is reduced to a single datanode 
F.

# stop writing and fail fast.

# recruit a new set of datanodes to be added to F, with F still being the first 
datanode in the pipeline.

# finish writing the block to F and ask NN to wait for 2 replicas before 
closing the block.

1) has the drawback of failing even when F is healthy, which is undesirable. 
Both 2) and 3) will fail when F is bad. And both 2) and 3) will likely succeed 
when F is healthy. Of the two, I'd prefer 2) over 3), since how soon a replica 
can be replicated from one datanode to another depends on many factors. It is 
less predictable than 2), where the client actively setting up a new pipeline 
and resume writing. One long-tail task spending much longer time to finish than 
others impacts the total running time of the whole job.

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.  Reading is working fine for any data size.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails for the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks

2011-01-25 Thread Ken Goodhope (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Goodhope updated HDFS-863:
--

Attachment: HDFS-863.patch

Thought I caught all of those, but obviously not.  Found a few more and removed 
them as well.  Thanks.

> Potential deadlock in TestOverReplicatedBlocks
> --
>
> Key: HDFS-863
> URL: https://issues.apache.org/jira/browse/HDFS-863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Todd Lipcon
>Assignee: Ken Goodhope
> Fix For: 0.23.0
>
> Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, 
> HDFS-863.patch, TestNodeCount.png
>
>
> TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on 
> namesystem.heartbeats without synchronizing on namesystem first. Other places 
> in the code synchronize namesystem, then heartbeats. It's probably unlikely 
> to occur in this test case, but it's a simple fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-1595:
-

Description: 
Suppose a source datanode S is writing to a destination datanode D in a write 
pipeline.  We have an implicit assumption that _if S catches an exception when 
it is writing to D, then D is faulty and S is fine._  As a result, DFSClient 
will take out D from the pipeline, reconstruct the write pipeline with the 
remaining datanodes and then continue writing .

However, we find a case that the faulty machine F is indeed S but not D.  In 
the case we found, F has a faulty network interface (or a faulty switch port) 
in such a way that the faulty network interface works fine when sending out a 
small amount of data, say 1MB, but it fails when sending out a large amount of 
data, say 100MB.  Reading is working fine for any data size.

It is even worst if F is the first datanode in the pipeline.  Consider the 
following:
# DFSClient creates a pipeline with three datanodes.  The first datanode is F.
# F catches an IOException when writing to the second datanode. Then, F reports 
the second datanode has error.
# DFSClient removes the second datanode from the pipeline and continue writing 
with the remaining datanode(s).
# The pipeline now has two datanodes but (2) and (3) repeat.
# Now, only F remains in the pipeline.  DFSClient continues writing with one 
replica in F.
# The write succeeds and DFSClient is able to *close the file successfully*.
# The block is under replicated.  The NameNode schedules replication from F to 
some other datanode D.
# The replication fails for the same reason.  D reports to the NameNode that 
the replica in F is corrupted.
# The NameNode marks the replica in F is corrupted.
# The block is corrupted since no replica is available.

This is a *data loss* scenario.

  was:
Suppose a source datanode S is writing to a destination datanode D in a write 
pipeline.  We have an implicit assumption that _if S catches an exception when 
it is writing to D, then D is faulty and S is fine._  As a result, DFSClient 
will take out D from the pipeline, reconstruct the write pipeline with the 
remaining datanodes and then continue writing .

However, we find a case that the faulty machine F is indeed S but not D.  In 
the case we found, F has a faulty network interface (or a faulty switch port) 
in such a way that the faulty network interface works fine when sending out a 
small amount of data, say 1MB, but it fails when sending out a large amount of 
data, say 100MB.

It is even worst if F is the first datanode in the pipeline.  Consider the 
following:
# DFSClient creates a pipeline with three datanodes.  The first datanode is F.
# F catches an IOException when writing to the second datanode. Then, F reports 
the second datanode has error.
# DFSClient removes the second datanode from the pipeline and continue writing 
with the remaining datanode(s).
# The pipeline now has two datanodes but (2) and (3) repeat.
# Now, only F remains in the pipeline.  DFSClient continues writing with one 
replica in F.
# The write succeeds and DFSClient is able to *close the file successfully*.
# The block is under replicated.  The NameNode schedules replication from F to 
some other datanode D.
# The replication fails from the same reason.  D reports to the NameNode that 
the replica in F is corrupted.
# The NameNode marks the replica in F is corrupted.
# The block is corrupted since no replica is available.

This is a *data loss* scenario.


Yes, reading is working fine for any data size.  (updated also the description.)

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.  Reading is working fine for any data size.
> It is even worst if F is the first datanode in the pipeline.  

[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986612#action_12986612
 ] 

Hairong Kuang commented on HDFS-1595:
-

> In the case we found, F has a faulty network interface (or a faulty switch 
> port) in such a way that the faulty network interface works fine when sending 
> out a small amount of data, say 1MB, but it fails when sending out a large 
> amount of data, say 100MB.

So this faulty node F has no problem receiving large amount of data? That seems 
an extreme case. Normally the client should get an error sending data to this 
faulty Node F and thus removes F from the write pipeline.

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails from the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks

2011-01-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986611#action_12986611
 ] 

Todd Lipcon commented on HDFS-863:
--

Hi Ken. Looks good except for one nit - there are some hard tab characters (eg 
TestNodeCount.java:117). The style guide is 2-space indentation. Mind 
reformatting those?

> Potential deadlock in TestOverReplicatedBlocks
> --
>
> Key: HDFS-863
> URL: https://issues.apache.org/jira/browse/HDFS-863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Todd Lipcon
>Assignee: Ken Goodhope
> Fix For: 0.23.0
>
> Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, 
> TestNodeCount.png
>
>
> TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on 
> namesystem.heartbeats without synchronizing on namesystem first. Other places 
> in the code synchronize namesystem, then heartbeats. It's probably unlikely 
> to occur in this test case, but it's a simple fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks

2011-01-25 Thread Ken Goodhope (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Goodhope updated HDFS-863:
--

Attachment: HDFS-863.patch

> Potential deadlock in TestOverReplicatedBlocks
> --
>
> Key: HDFS-863
> URL: https://issues.apache.org/jira/browse/HDFS-863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Todd Lipcon
>Assignee: Ken Goodhope
> Fix For: 0.23.0
>
> Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, 
> TestNodeCount.png
>
>
> TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on 
> namesystem.heartbeats without synchronizing on namesystem first. Other places 
> in the code synchronize namesystem, then heartbeats. It's probably unlikely 
> to occur in this test case, but it's a simple fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks

2011-01-25 Thread Ken Goodhope (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Goodhope updated HDFS-863:
--

Attachment: (was: HDFS-863.patch)

> Potential deadlock in TestOverReplicatedBlocks
> --
>
> Key: HDFS-863
> URL: https://issues.apache.org/jira/browse/HDFS-863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Todd Lipcon
>Assignee: Ken Goodhope
> Fix For: 0.23.0
>
> Attachments: cycle.png, HDFS-863.patch, TestNodeCount.png
>
>
> TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on 
> namesystem.heartbeats without synchronizing on namesystem first. Other places 
> in the code synchronize namesystem, then heartbeats. It's probably unlikely 
> to occur in this test case, but it's a simple fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-863) Potential deadlock in TestOverReplicatedBlocks

2011-01-25 Thread Ken Goodhope (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Goodhope updated HDFS-863:
--

Attachment: HDFS-863.patch

Agreed, and done.  Reran tests with the following results

[junit] Test org.apache.hadoop.hdfs.server.namenode.TestStorageRestore 
FAILED
[junit] Test org.apache.hadoop.hdfs.TestFileConcurrentReader FAILED
[junit] Test 
org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark FAILED

Skipped 
src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestLargeDirectoryDelete.java
 since last time that test stalled.

> Potential deadlock in TestOverReplicatedBlocks
> --
>
> Key: HDFS-863
> URL: https://issues.apache.org/jira/browse/HDFS-863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Todd Lipcon
>Assignee: Ken Goodhope
> Fix For: 0.23.0
>
> Attachments: cycle.png, HDFS-863.patch, HDFS-863.patch, 
> TestNodeCount.png
>
>
> TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on 
> namesystem.heartbeats without synchronizing on namesystem first. Other places 
> in the code synchronize namesystem, then heartbeats. It's probably unlikely 
> to occur in this test case, but it's a simple fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986589#action_12986589
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1595:
--

> I'll be honest: I don't know the new Append code in trunk very well. I 
> thought the client called nn.updatePipeline() whenever a node was removed 
> from the pipeline. That's not the case?

You are actually right about the new append codes.  I am sorry that I was 
mostly looking at the 0.20 codes.  It will work for 0.22 but we need to change 
protocol for 0.20.  Nice.

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails from the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1295) Improve namenode restart times by short-circuiting the first block reports from datanodes

2011-01-25 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986564#action_12986564
 ] 

Matt Foley commented on HDFS-1295:
--

Hi Dhruba, I think this is a really important improvement to startup time, so I 
will try to get some contributors here to review it.
Regarding this dangling issue, can you please describe any risks you see from 
corrupt replicas, with this shortcut in place?  Thanks.

> Improve namenode restart times by short-circuiting the first block reports 
> from datanodes
> -
>
> Key: HDFS-1295
> URL: https://issues.apache.org/jira/browse/HDFS-1295
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: shortCircuitBlockReport_1.txt
>
>
> The namenode restart is dominated by the performance of processing block 
> reports. On a 2000 node cluster with 90 million blocks,  block report 
> processing takes 30 to 40 minutes. The namenode "diffs" the contents of the 
> incoming block report with the contents of the blocks map, and then applies 
> these diffs to the blocksMap, but in reality there is no need to compute the 
> "diff" because this is the first block report from the datanode.
> This code change improves block report processing time by 300%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986560#action_12986560
 ] 

Todd Lipcon commented on HDFS-1595:
---

I'll be honest: I don't know the new Append code in trunk very well. I thought 
the client called nn.updatePipeline() whenever a node was removed from the 
pipeline. That's not the case?

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails from the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1595) DFSClient may incorrectly detect datanode failure

2011-01-25 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986554#action_12986554
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1595:
--

{code}
+int pipelineReplication = biuc.getNumExpectedLocations();
{code}
Hi Todd, I might be wrong but I think it is not that simple as in 
[hdfs-1595-idea.txt|https://issues.apache.org/jira/secure/attachment/12469243/hdfs-1595-idea.txt].
  The value of pipelineReplication above is not equal to the actual number of 
datanodes in the pipeline.  In other words, it won't be updated when datanodes 
are removed from the pipeline.

We need to change some protocol in order to implement this idea.

> DFSClient may incorrectly detect datanode failure
> -
>
> Key: HDFS-1595
> URL: https://issues.apache.org/jira/browse/HDFS-1595
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.4
>Reporter: Tsz Wo (Nicholas), SZE
>Priority: Critical
> Attachments: hdfs-1595-idea.txt
>
>
> Suppose a source datanode S is writing to a destination datanode D in a write 
> pipeline.  We have an implicit assumption that _if S catches an exception 
> when it is writing to D, then D is faulty and S is fine._  As a result, 
> DFSClient will take out D from the pipeline, reconstruct the write pipeline 
> with the remaining datanodes and then continue writing .
> However, we find a case that the faulty machine F is indeed S but not D.  In 
> the case we found, F has a faulty network interface (or a faulty switch port) 
> in such a way that the faulty network interface works fine when sending out a 
> small amount of data, say 1MB, but it fails when sending out a large amount 
> of data, say 100MB.
> It is even worst if F is the first datanode in the pipeline.  Consider the 
> following:
> # DFSClient creates a pipeline with three datanodes.  The first datanode is F.
> # F catches an IOException when writing to the second datanode. Then, F 
> reports the second datanode has error.
> # DFSClient removes the second datanode from the pipeline and continue 
> writing with the remaining datanode(s).
> # The pipeline now has two datanodes but (2) and (3) repeat.
> # Now, only F remains in the pipeline.  DFSClient continues writing with one 
> replica in F.
> # The write succeeds and DFSClient is able to *close the file successfully*.
> # The block is under replicated.  The NameNode schedules replication from F 
> to some other datanode D.
> # The replication fails from the same reason.  D reports to the NameNode that 
> the replica in F is corrupted.
> # The NameNode marks the replica in F is corrupted.
> # The block is corrupted since no replica is available.
> This is a *data loss* scenario.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1596) Move secondary namenode checkpoint configs from core-default.xml to hdfs-default.xml

2011-01-25 Thread Patrick Angeles (JIRA)
Move secondary namenode checkpoint configs from core-default.xml to 
hdfs-default.xml


 Key: HDFS-1596
 URL: https://issues.apache.org/jira/browse/HDFS-1596
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Reporter: Patrick Angeles


The following configs are in core-default.xml, but are really read by the 
Secondary Namenode. These should be moved to hdfs-default.xml for consistency.


  fs.checkpoint.dir
  ${hadoop.tmp.dir}/dfs/namesecondary
  Determines where on the local filesystem the DFS secondary
  name node should store the temporary images to merge.
  If this is a comma-delimited list of directories then the image is
  replicated in all of the directories for redundancy.
  



  fs.checkpoint.edits.dir
  ${fs.checkpoint.dir}
  Determines where on the local filesystem the DFS secondary
  name node should store the temporary edits to merge.
  If this is a comma-delimited list of directoires then teh edits is
  replicated in all of the directoires for redundancy.
  Default value is same as fs.checkpoint.dir
  



  fs.checkpoint.period
  3600
  The number of seconds between two periodic checkpoints.
  



  fs.checkpoint.size
  67108864
  The size of the current edit log (in bytes) that triggers
   a periodic checkpoint even if the fs.checkpoint.period hasn't expired.
  


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1594) When the disk becomes full Namenode is getting shutdown and not able to recover

2011-01-25 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated HDFS-1594:


Status: Open  (was: Patch Available)

> When the disk becomes full Namenode is getting shutdown and not able to 
> recover
> ---
>
> Key: HDFS-1594
> URL: https://issues.apache.org/jira/browse/HDFS-1594
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.21.0, 0.21.1, 0.22.0
> Environment: Linux linux124 2.6.27.19-5-default #1 SMP 2009-02-28 
> 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Devaraj K
> Attachments: hadoop-root-namenode-linux124.log, HDFS-1594.patch
>
>
> When the disk becomes full name node is shutting down and if we try to start 
> after making the space available It is not starting and throwing the below 
> exception.
> {code:xml} 
> 2011-01-24 23:23:33,727 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem 
> initialization failed.
> java.io.EOFException
>   at java.io.DataInputStream.readFully(DataInputStream.java:180)
>   at org.apache.hadoop.io.UTF8.readFields(UTF8.java:117)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.readString(FSImageSerialization.java:201)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:60)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1089)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1041)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:487)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:149)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:306)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:284)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:328)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:356)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:577)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:570)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1538)
> 2011-01-24 23:23:33,729 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>   at java.io.DataInputStream.readFully(DataInputStream.java:180)
>   at org.apache.hadoop.io.UTF8.readFields(UTF8.java:117)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageSerialization.readString(FSImageSerialization.java:201)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:60)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1089)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1041)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:487)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:149)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:306)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:284)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:328)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:356)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:577)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:570)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1538)
> 2011-01-24 23:23:33,730 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down NameNode at linux124/10.18.52.124
> /
> {code} 

-- 
This messa