[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS

2010-04-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859234#action_12859234
 ] 

Hadoop QA commented on HDFS-1031:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12442391/hdfs-1031_aoriani_4.patch
  against trunk revision 936132.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321/console

This message is automatically generated.

> Enhance the webUi to list a few of the corrupted files in HDFS
> --
>
> Key: HDFS-1031
> URL: https://issues.apache.org/jira/browse/HDFS-1031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: André Oriani
> Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, 
> hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch
>
>
> The existing webUI displays something like this:
> WARNING : There are about 12 missing blocks. Please check the log or run 
> fsck. 
> It would be nice if we can display the filenames that have missing blocks. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-142) In 0.20, move blocks being written into a blocksBeingWritten directory

2010-04-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-142:
-

Summary: In 0.20, move blocks being written into a blocksBeingWritten 
directory  (was: Datanode should delete files under tmp when upgraded from 0.17)

Renaming JIRA to reflect the actual scope of this issue in the branch-20 sync 
work

> In 0.20, move blocks being written into a blocksBeingWritten directory
> --
>
> Key: HDFS-142
> URL: https://issues.apache.org/jira/browse/HDFS-142
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Raghu Angadi
>Assignee: dhruba borthakur
>Priority: Blocker
> Attachments: appendQuestions.txt, deleteTmp.patch, deleteTmp2.patch, 
> deleteTmp5_20.txt, deleteTmp5_20.txt, deleteTmp_0.18.patch, handleTmp1.patch, 
> hdfs-142-minidfs-fix-from-409.txt, 
> HDFS-142-multiple-blocks-datanode-exception.patch, HDFS-142_20.patch, 
> testfileappend4-deaddn.txt
>
>
> Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp  
> directory since these files are not valid anymore. But in 0.18 it moves these 
> files to normal directory incorrectly making them valid blocks. One of the 
> following would work :
> - remove the tmp files during upgrade, or
> - if the files under /tmp are in pre-18 format (i.e. no generation), delete 
> them.
> Currently effect of this bug is that, these files end up failing block 
> verification and eventually get deleted. But cause incorrect over-replication 
> at the namenode before that.
> Also it looks like our policy regd treating files under tmp needs to be 
> defined better. Right now there are probably one or two more bugs with it. 
> Dhruba, please file them if you rememeber.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-142) Datanode should delete files under tmp when upgraded from 0.17

2010-04-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-142:
-

Attachment: testfileappend4-deaddn.txt

I found a bug in the append code where it doesn't work properly with the 
following sequence:
- open a file for write
- write some data
- close it
- the DN with the lowest name dies, but not yet marked dead on the NN
- a client calls append() to try to recover the lease (not knowing that the 
file isn't currently under construction)

In this case, the client ends up thinking it has opened the file for append, 
and there's a new lease on the NN side, but on the client side it's in an error 
state where close() will throw IOE (and not close the new lease).

Attaching a new case for TestFileAppend4 for this situation.

> Datanode should delete files under tmp when upgraded from 0.17
> --
>
> Key: HDFS-142
> URL: https://issues.apache.org/jira/browse/HDFS-142
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Raghu Angadi
>Assignee: dhruba borthakur
>Priority: Blocker
> Attachments: appendQuestions.txt, deleteTmp.patch, deleteTmp2.patch, 
> deleteTmp5_20.txt, deleteTmp5_20.txt, deleteTmp_0.18.patch, handleTmp1.patch, 
> hdfs-142-minidfs-fix-from-409.txt, 
> HDFS-142-multiple-blocks-datanode-exception.patch, HDFS-142_20.patch, 
> testfileappend4-deaddn.txt
>
>
> Before 0.18, when Datanode restarts, it deletes files under data-dir/tmp  
> directory since these files are not valid anymore. But in 0.18 it moves these 
> files to normal directory incorrectly making them valid blocks. One of the 
> following would work :
> - remove the tmp files during upgrade, or
> - if the files under /tmp are in pre-18 format (i.e. no generation), delete 
> them.
> Currently effect of this bug is that, these files end up failing block 
> verification and eventually get deleted. But cause incorrect over-replication 
> at the namenode before that.
> Also it looks like our policy regd treating files under tmp needs to be 
> defined better. Right now there are probably one or two more bugs with it. 
> Dhruba, please file them if you rememeber.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-875) NameNode incorretly handles corrupt replicas

2010-04-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859218#action_12859218
 ] 

Todd Lipcon commented on HDFS-875:
--

Is this related/the same as HDFS-900?

> NameNode incorretly handles corrupt replicas
> 
>
> Key: HDFS-875
> URL: https://issues.apache.org/jira/browse/HDFS-875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Hairong Kuang
> Fix For: 0.21.0, 0.22.0
>
>
> I reviewed how NameNode handles corrupt replicas as part of work on HDFS-145. 
> Comparing to releases prior to 0.21, NameNode now does a good job identifying 
> corrupt replicas, but it seems to me there are two flaws how it handles the 
> corrupt replicas:
> 1. NameNode does not add corrupt replicas to the block locations as what 
> NameNode does before;
> 2. If the corruption is caused by generation stamp mismatch or state 
> mismatch, the wrong GS and state do not get put in corruptReplicasMap. 
> Therefore it may lead to the deletion of the wrong replica. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode

2010-04-20 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-966:
--

Status: Patch Available  (was: Open)

> NameNode recovers lease even in safemode
> 
>
> Key: HDFS-966
> URL: https://issues.apache.org/jira/browse/HDFS-966
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt
>
>
> The NameNode recovers a lease even when it is in safemode. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode

2010-04-20 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-966:
--

Status: Open  (was: Patch Available)

> NameNode recovers lease even in safemode
> 
>
> Key: HDFS-966
> URL: https://issues.apache.org/jira/browse/HDFS-966
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt
>
>
> The NameNode recovers a lease even when it is in safemode. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-966) NameNode recovers lease even in safemode

2010-04-20 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859192#action_12859192
 ] 

dhruba borthakur commented on HDFS-966:
---

The failed unit test is datanode.TestDiskError and is not connected to this 
patch, but I will resubmit this patch again.

> NameNode recovers lease even in safemode
> 
>
> Key: HDFS-966
> URL: https://issues.apache.org/jira/browse/HDFS-966
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt
>
>
> The NameNode recovers a lease even when it is in safemode. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS

2010-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859188#action_12859188
 ] 

André Oriani commented on HDFS-1031:


In case hudson is still not adding test result to Jira, the build is 
http://hudson.zones.apache.org/hudson/view/Hdfs/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/321

> Enhance the webUi to list a few of the corrupted files in HDFS
> --
>
> Key: HDFS-1031
> URL: https://issues.apache.org/jira/browse/HDFS-1031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: André Oriani
> Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, 
> hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch
>
>
> The existing webUI displays something like this:
> WARNING : There are about 12 missing blocks. Please check the log or run 
> fsck. 
> It would be nice if we can display the filenames that have missing blocks. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS

2010-04-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

André Oriani updated HDFS-1031:
---

Attachment: hdfs-1031_aoriani_4.patch

Suggestions applied
File list sorted
Some changes made due semantic issues ( listed files are not potentially 
corrupt, but in fact corrupt.  The list is potentially incomplete) 

> Enhance the webUi to list a few of the corrupted files in HDFS
> --
>
> Key: HDFS-1031
> URL: https://issues.apache.org/jira/browse/HDFS-1031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: André Oriani
> Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, 
> hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch
>
>
> The existing webUI displays something like this:
> WARNING : There are about 12 missing blocks. Please check the log or run 
> fsck. 
> It would be nice if we can display the filenames that have missing blocks. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS

2010-04-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

André Oriani updated HDFS-1031:
---

Status: Patch Available  (was: Open)

> Enhance the webUi to list a few of the corrupted files in HDFS
> --
>
> Key: HDFS-1031
> URL: https://issues.apache.org/jira/browse/HDFS-1031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: André Oriani
> Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, 
> hdfs-1031_aoriani_3.patch, hdfs-1031_aoriani_4.patch
>
>
> The existing webUI displays something like this:
> WARNING : There are about 12 missing blocks. Please check the log or run 
> fsck. 
> It would be nice if we can display the filenames that have missing blocks. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS

2010-04-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

André Oriani updated HDFS-1031:
---

Status: Open  (was: Patch Available)

> Enhance the webUi to list a few of the corrupted files in HDFS
> --
>
> Key: HDFS-1031
> URL: https://issues.apache.org/jira/browse/HDFS-1031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: André Oriani
> Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, 
> hdfs-1031_aoriani_3.patch
>
>
> The existing webUI displays something like this:
> WARNING : There are about 12 missing blocks. Please check the log or run 
> fsck. 
> It would be nice if we can display the filenames that have missing blocks. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-909:
-

Fix Version/s: 0.20.3

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.20.3, 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-branch-0.21.txt, hdfs-909-unified.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-909:
-

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

I just committed this. Thank you Todd.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-branch-0.21.txt, hdfs-909-unified.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-909:
-

Attachment: hdfs-909-branch-0.21.txt

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-branch-0.21.txt, hdfs-909-unified.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer

2010-04-20 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859171#action_12859171
 ] 

sam rash commented on HDFS-1102:


actually, i was discussing this with another friend and they pointed out that 
we don't even need to change how hftp works.  even w/chunked encoding, we 
should be able to verify on the client since it'll send:


size1\n


size2\n


0

if we don't see fewer than size_N bytes or do not see the 0, we missed data.  
the underlying http client *should* handle this. if not, we can switch to:

http://hc.apache.org/

which apparently is better than using java.net.URL's underlying connection 
client.



> HftpFileSystem : errors during transfer result in truncated transfer
> 
>
> Key: HDFS-1102
> URL: https://issues.apache.org/jira/browse/HDFS-1102
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.1
>Reporter: sam rash
>
> If an error occurs transferring the data over HTTP, the HftpInputStream does 
> not know it received fewer bytes than the file contains.  We can at least 
> detect this and throw an EOFException when this occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-966) NameNode recovers lease even in safemode

2010-04-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859168#action_12859168
 ] 

Hadoop QA commented on HDFS-966:


-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12442369/leaseRecoverSafeMode2.txt
  against trunk revision 936024.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/157/console

This message is automatically generated.

> NameNode recovers lease even in safemode
> 
>
> Key: HDFS-966
> URL: https://issues.apache.org/jira/browse/HDFS-966
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt
>
>
> The NameNode recovers a lease even when it is in safemode. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-780) Revive TestFuseDFS

2010-04-20 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-780:
-

Attachment: hdfs-780-1.patch

Attached a patch that fixes up all the build files to get the test running 
again. The test itself fails due to HDFS-940 and some issues with the java code 
in the test itself. Run with:
{code}
$ ant -Dcompile.c++=true -Dlibhdfs=true compile
$ ant -Dlibhdfs=1 -Dfusedfs=1 test-contrib 
{code}

> Revive TestFuseDFS
> --
>
> Key: HDFS-780
> URL: https://issues.apache.org/jira/browse/HDFS-780
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: contrib/fuse-dfs
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-780-1.patch
>
>
> Looks like TestFuseDFS has bit rot. Let's revive it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859156#action_12859156
 ] 

Hadoop QA commented on HDFS-1101:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12442342/H1101-1.patch
  against trunk revision 936024.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/320/console

This message is automatically generated.

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1052) HDFS scalability with multiple namenodes

2010-04-20 Thread Sanjay Radia (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjay Radia updated HDFS-1052:
---

Attachment: Mulitple Namespaces5.pdf

Minor updates to the doc (plus name change).

> HDFS scalability with multiple namenodes
> 
>
> Key: HDFS-1052
> URL: https://issues.apache.org/jira/browse/HDFS-1052
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Suresh Srinivas
>Assignee: Suresh Srinivas
> Attachments: Block pool proposal.pdf, Mulitple Namespaces5.pdf
>
>
> HDFS currently uses a single namenode that limits scalability of the cluster. 
> This jira proposes an architecture to scale the nameservice horizontally 
> using multiple namenodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-909:
-

Attachment: hdfs-909-unified.txt
hdfs-909-branch-0.20.txt

Here's a unified patch for trunk (the one you committed to trunk plus the test 
case fixes)
Also branch 20 patch that addresses the two eclipse warnings you found.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-unified.txt, hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859133#action_12859133
 ] 

Konstantin Shvachko commented on HDFS-909:
--

- The issue is not closed, so it would be better to have a unified patch, 
rather than doing 2 commits. I don't mind to recommit.
- Test for 0.20 passes fine now. Found 2 (eclipse) warnings in TestEditLogRace:
-- Method {{getFormattedFSImage()}} is not used anywhere.
-- Static method {{setBufferCapacity()}} should be called in static manner, 
like {{FSEditLog.setBufferCapacity()}}
- I understand Tom's plan for 0.21. It does not hurt to commit though.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode

2010-04-20 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-966:
--

Attachment: leaseRecoverSafeMode2.txt

Merged patch with latest trunk.

> NameNode recovers lease even in safemode
> 
>
> Key: HDFS-966
> URL: https://issues.apache.org/jira/browse/HDFS-966
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt
>
>
> The NameNode recovers a lease even when it is in safemode. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-966) NameNode recovers lease even in safemode

2010-04-20 Thread dhruba borthakur (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhruba borthakur updated HDFS-966:
--

Status: Patch Available  (was: Open)

> NameNode recovers lease even in safemode
> 
>
> Key: HDFS-966
> URL: https://issues.apache.org/jira/browse/HDFS-966
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Attachments: leaseRecoverSafeMode.txt, leaseRecoverSafeMode2.txt
>
>
> The NameNode recovers a lease even when it is in safemode. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859125#action_12859125
 ] 

Todd Lipcon commented on HDFS-909:
--

hdfs-909-ammendation.txt goes with this comment above:

https://issues.apache.org/jira/browse/HDFS-909?focusedCommentId=12859069&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12859069

(the test as committed in trunk is flaky as well, this is a patch against trunk 
that fixes it. The bug is just in the test, though, not the code itself)

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859123#action_12859123
 ] 

Konstantin Shvachko commented on HDFS-909:
--

What is hdfs-909-ammendation.txt for?

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859086#action_12859086
 ] 

Todd Lipcon commented on HDFS-909:
--

bq. Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here 
is the patch.

The plan for HBase 0.20.5 is to work against Tom's new 21 release or a 20 with 
HDFS-200 applied,
not the current 21 branch. I checked with Cosmin and he is OK moving to what's 
now trunk.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-909:
-

Attachment: hdfs-909-branch-0.20.txt

Updated branch-20 patch with same changes (plus cleanup of the changes I 
accidentally left in before)

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859071#action_12859071
 ] 

Konstantin Shvachko commented on HDFS-909:
--

FSEditLog,java imports org.apache.tools.ant.taskdefs.WaitFor in your patch for 
0.20.
As you see I've already committed the other two branches. So it would be good 
to finish this sooner than later. Thanks.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-909:
-

Attachment: hdfs-909-ammendation.txt

It turns out the test on trunk was flaky as well. The issue was that we were 
calling saveNamespace directly on the FSImage while also performing edits from 
the Transactions threads. This is exactly the behavior we're trying to avoid by 
forcing the NN into safemode first. Also, we were calling verifyEdits() on an 
edit log that was being simultaneously written to, which is likely to fail if 
it reads a partial edit.

This patch against trunk does the following:
- Bumps up the number of rolls and saves to 30 instead of 10, since obviously 
10 wasn't enough to have it fail reliably.
- Replaces use of the FSN log with the test's own log
- Changes the transaction threads to operate via FSN rather than logging 
directly to the edit log.
- Any exceptions thrown by the edits will cause the test to properly fail

To verify this fix, I temporarily bumped the constants for number of rolls up 
to 200 and checked that it passed.

This failed sometimes for me without HADOOP-6717, a trivial patch which reduces 
the amount of log output from new security code.

I'll separately amend the branch-20 patch with the same changes.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-ammendation.txt, hdfs-909-branch-0.20.txt, 
> hdfs-909-branch-0.21.txt, hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer

2010-04-20 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859056#action_12859056
 ] 

sam rash commented on HDFS-1102:


well, it's not w/o modifying hftp, but a client can verify the content length 
if it's there, and proceed as it does now if it's not present.

we could also have a config option that say "enforce content length" which 
would cause missing content length => throw IOException.

in this way, if both the client and server are on the latest hftp, this works, 
else it will work as before

offhand, i'm not sure how to do this without either changing hftp or wrapping 
it in some other protocol that does length checking

> HftpFileSystem : errors during transfer result in truncated transfer
> 
>
> Key: HDFS-1102
> URL: https://issues.apache.org/jira/browse/HDFS-1102
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.1
>Reporter: sam rash
>
> If an error occurs transferring the data over HTTP, the HftpInputStream does 
> not know it received fewer bytes than the file contains.  We can at least 
> detect this and throw an EOFException when this occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859049#action_12859049
 ] 

Konstantin Shvachko commented on HDFS-1101:
---

I agree, this looks better. Thanks.
+1 for the patch.

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer

2010-04-20 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859030#action_12859030
 ] 

dhruba borthakur commented on HDFS-1102:


> It would be nice if we can fix this without changing the hftp protocol.

any idea on how this can be done?

> HftpFileSystem : errors during transfer result in truncated transfer
> 
>
> Key: HDFS-1102
> URL: https://issues.apache.org/jira/browse/HDFS-1102
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.1
>Reporter: sam rash
>
> If an error occurs transferring the data over HTTP, the HftpInputStream does 
> not know it received fewer bytes than the file contains.  We can at least 
> detect this and throw an EOFException when this occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer

2010-04-20 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi resolved HDFS-1102.


Resolution: Duplicate

Duplicate of HDFS-1085.   It would be nice if we can fix this without changing 
the hftp protocol.

> HftpFileSystem : errors during transfer result in truncated transfer
> 
>
> Key: HDFS-1102
> URL: https://issues.apache.org/jira/browse/HDFS-1102
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.1
>Reporter: sam rash
>
> If an error occurs transferring the data over HTTP, the HftpInputStream does 
> not know it received fewer bytes than the file contains.  We can at least 
> detect this and throw an EOFException when this occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1087) Use StringBuilder instead of Formatter for audit logs

2010-04-20 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated HDFS-1087:


   Status: Resolved  (was: Patch Available)
 Hadoop Flags: [Reviewed]
 Assignee: Chris Douglas
Fix Version/s: 0.22.0
   Resolution: Fixed

I committed this.

Thanks for the review, Nicholas

> Use StringBuilder instead of Formatter for audit logs
> -
>
> Key: HDFS-1087
> URL: https://issues.apache.org/jira/browse/HDFS-1087
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Reporter: Chris Douglas
>Assignee: Chris Douglas
>Priority: Minor
> Fix For: 0.22.0
>
> Attachments: H1087-0.patch, H1087-1.patch, H1087-2.patch
>
>
> The audit logs do not use any {{format}} functionality that cannot be 
> replaced by a simple, more efficient set of appends.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated HDFS-1101:


Status: Open  (was: Patch Available)

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated HDFS-1101:


Status: Patch Available  (was: Open)

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated HDFS-1101:


Attachment: H1101-1.patch

Forgot to include Konstantin's javac warning fixes

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: H1101-0.patch, H1101-1.patch, TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated HDFS-1101:


Attachment: H1101-0.patch

I'm sorry, I missed this in review.

Though the current patch works for this case with only 1 datanode, pulling it 
from the DataNode is closer to the intent of the test and doesn't modify 
MiniDFSCluster.

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: H1101-0.patch, TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859001#action_12859001
 ] 

Todd Lipcon commented on HDFS-909:
--

Hi Konstantin,

Thanks for the review. It does seem like the test for branch-20 occasionally 
fails - I had it passing here, but it's flaky and
doesn't pass every time. Let me dig into this and upload a new fixed patch.

bq. What org.apache.tools.ant.taskdefs.WaitFor is used for?

No idea where this came from. I've been trying out Eclipse recently instead of 
my usual vim, and haven't gotten used to
clean up after its "smarts" :) Will doublecheck the next patch for such cruft 
as well.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858848#action_12858848
 ] 

Hadoop QA commented on HDFS-1101:
-

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12442252/TestDiskErrorLocal.patch
  against trunk revision 935778.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/319/console

This message is automatically generated.

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-481) Bug Fixes + HdfsProxy to use proxy user to impresonate the real user

2010-04-20 Thread Srikanth Sundarrajan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sundarrajan updated HDFS-481:
--

Attachment: HDFS-481-bp-y20s.patch

Incremental back port to fix broken unit tests in y20.1xx & y20.101. Tests are 
broken due to 

* Missing super user setup when Mini DFS Cluster starts
* Missing src/test/resources folder
* UserGroupInformation class depending on krb5.conf in the system. (Bypassing 
that through krb5.conf in ${hadoop.core}/src/test  - 
contrib/hdfsproxy/build.xml change)  

This patch needs to be applied incrementally over the HDFS-481-NEW.patch.

> Bug Fixes + HdfsProxy to use proxy user to impresonate the real user
> 
>
> Key: HDFS-481
> URL: https://issues.apache.org/jira/browse/HDFS-481
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/hdfsproxy
>Affects Versions: 0.21.0
>Reporter: zhiyong zhang
>Assignee: Srikanth Sundarrajan
> Fix For: 0.22.0
>
> Attachments: HDFS-481-bp-y20.patch, HDFS-481-bp-y20.patch, 
> HDFS-481-bp-y20s.patch, HDFS-481-bp-y20s.patch, HDFS-481-bp-y20s.patch, 
> HDFS-481-NEW.patch, HDFS-481.out, HDFS-481.patch, HDFS-481.patch, 
> HDFS-481.patch, HDFS-481.patch, HDFS-481.patch, HDFS-481.patch, 
> HDFS-481.patch, HDFS-481.patch
>
>
> Bugs:
> 1. hadoop-version is not recognized if run ant command from src/contrib/ or 
> from src/contrib/hdfsproxy  
> If running ant command from $HADOOP_HDFS_HOME, hadoop-version will be passed 
> to contrib's build through subant. But if running from src/contrib or 
> src/contrib/hdfsproxy, the hadoop-version will not be recognized. 
> 2. LdapIpDirFilter.java is not thread safe. userName, Group & Paths are per 
> request and can't be class members.
> 3. Addressed the following StackOverflowError. 
> ERROR [org.apache.catalina.core.ContainerBase.[Catalina].[localh
> ost].[/].[proxyForward]] Servlet.service() for servlet proxyForward threw 
> exception
> java.lang.StackOverflowError
> at 
> org.apache.catalina.core.ApplicationHttpRequest.getAttribute(ApplicationHttpR
> equest.java:229)
>  This is due to when the target war (/target.war) does not exist, the 
> forwarding war will forward to its parent context path /, which defines the 
> forwarding war itself. This cause infinite loop.  Added "HDFS Proxy 
> Forward".equals(dstContext.getServletContextName() in the if logic to break 
> the loop.
> 4. Kerberos credentials of remote user aren't available. HdfsProxy needs to 
> act on behalf of the real user to service the requests

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS

2010-04-20 Thread Rodrigo Schmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858793#action_12858793
 ] 

Rodrigo Schmidt commented on HDFS-1031:
---

Side effect in my opinion!

> Enhance the webUi to list a few of the corrupted files in HDFS
> --
>
> Key: HDFS-1031
> URL: https://issues.apache.org/jira/browse/HDFS-1031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: André Oriani
> Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, 
> hdfs-1031_aoriani_3.patch
>
>
> The existing webUI displays something like this:
> WARNING : There are about 12 missing blocks. Please check the log or run 
> fsck. 
> It would be nice if we can display the filenames that have missing blocks. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer

2010-04-20 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858750#action_12858750
 ] 

sam rash commented on HDFS-1102:


proposed solution:

StreamFile: Datanode will set the content length in the header
HftpInputStream: read() method will verify that when underlying input stream 
from the http connection returns -1 that it has all the bytes, else throws an 
EOFException


> HftpFileSystem : errors during transfer result in truncated transfer
> 
>
> Key: HDFS-1102
> URL: https://issues.apache.org/jira/browse/HDFS-1102
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.1
>Reporter: sam rash
>
> If an error occurs transferring the data over HTTP, the HftpInputStream does 
> not know it received fewer bytes than the file contains.  We can at least 
> detect this and throw an EOFException when this occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer

2010-04-20 Thread sam rash (JIRA)
HftpFileSystem : errors during transfer result in truncated transfer


 Key: HDFS-1102
 URL: https://issues.apache.org/jira/browse/HDFS-1102
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node, hdfs client
Affects Versions: 0.20.1
Reporter: sam rash


If an error occurs transferring the data over HTTP, the HftpInputStream does 
not know it received fewer bytes than the file contains.  We can at least 
detect this and throw an EOFException when this occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1102) HftpFileSystem : errors during transfer result in truncated transfer

2010-04-20 Thread sam rash (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858748#action_12858748
 ] 

sam rash commented on HDFS-1102:


the log entry in the datanode:

2010-04-19 16:42:59,072 ERROR org.mortbay.log: /streamFile
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at 
org.mortbay.io.nio.SelectChannelEndPoint.updateKey(SelectChannelEndPoint.java:324)
at 
org.mortbay.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:278)
at 
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:542)
at 
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at 
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
at 
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
at 
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
at 
org.apache.hadoop.hdfs.server.namenode.StreamFile.doGet(StreamFile.java:73)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:669)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)


> HftpFileSystem : errors during transfer result in truncated transfer
> 
>
> Key: HDFS-1102
> URL: https://issues.apache.org/jira/browse/HDFS-1102
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 0.20.1
>Reporter: sam rash
>
> If an error occurs transferring the data over HTTP, the HftpInputStream does 
> not know it received fewer bytes than the file contains.  We can at least 
> detect this and throw an EOFException when this occurs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-940) libhdfs uses UnixUserGroupInformation

2010-04-20 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-940:
-

  Summary: libhdfs uses UnixUserGroupInformation  (was: libhdfs test 
uses UnixUserGroupInformation)
Fix Version/s: 0.21.0
  Description: libhdfs uses non-existant class UnixUserGroupInformation.  
(was: The libhdfs test fails with the following, needs to be updated since 
UnixUserGroupInformation was removed.

 [exec] failed to construct hadoop user unix group info object
 [exec] Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.hadoop.security.UserGroupInformation: method ()V not found
 [exec] at 
org.apache.hadoop.security.UnixUserGroupInformation.(UnixUserGroupInformation.java:69)
 [exec] Call to org/apache/hadoop/security/UnixUserGroupInformation failed!
 [exec] Oops! Failed to connect to hdfs as user nobody!
)

> libhdfs uses UnixUserGroupInformation
> -
>
> Key: HDFS-940
> URL: https://issues.apache.org/jira/browse/HDFS-940
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs
>Affects Versions: 0.22.0
>Reporter: Eli Collins
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
>
> libhdfs uses non-existant class UnixUserGroupInformation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-1101:
--

Assignee: Konstantin Shvachko

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858742#action_12858742
 ] 

Konstantin Shvachko commented on HDFS-1101:
---

Looks like this was introduced by HDFS-997.

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-1101:
--

Status: Patch Available  (was: Open)

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1101) TestDiskError.testLocalDirs() fails

2010-04-20 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-1101:
--

Attachment: TestDiskErrorLocal.patch

MiniDFSCluster overrides the data-node storage directories while the original 
config is till pointing to the default values.
Therefore the directory cannot be found.
The patch fixes the problem and two java warnings in MiniDFSCluster.

> TestDiskError.testLocalDirs() fails
> ---
>
> Key: HDFS-1101
> URL: https://issues.apache.org/jira/browse/HDFS-1101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.22.0
>Reporter: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: TestDiskErrorLocal.patch
>
>
> {{TestDiskError.testLocalDirs()}} fails with {{FileNotFoundException}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1014) Error in reading delegation tokens from edit logs.

2010-04-20 Thread Jakob Homan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated HDFS-1014:
--

Hadoop Flags: [Reviewed]

> Error in reading delegation tokens from edit logs.
> --
>
> Key: HDFS-1014
> URL: https://issues.apache.org/jira/browse/HDFS-1014
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jitendra Nath Pandey
>Assignee: Jitendra Nath Pandey
> Attachments: HDFS-1014-y20.1.patch, HDFS-1014.2.patch, 
> HDFS-1014.3.patch
>
>
>  When delegation tokens are read from the edit logs...same object is used to 
> read the identifier and is stored in the token cache. This is wrong because 
> same object is getting updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1014) Error in reading delegation tokens from edit logs.

2010-04-20 Thread Jakob Homan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated HDFS-1014:
--

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.22.0
   Resolution: Fixed

I've just committed this.  Resolving as fixed.  Thanks for the contribution, 
Jitendra.

> Error in reading delegation tokens from edit logs.
> --
>
> Key: HDFS-1014
> URL: https://issues.apache.org/jira/browse/HDFS-1014
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jitendra Nath Pandey
>Assignee: Jitendra Nath Pandey
> Fix For: 0.22.0
>
> Attachments: HDFS-1014-y20.1.patch, HDFS-1014.2.patch, 
> HDFS-1014.3.patch
>
>
>  When delegation tokens are read from the edit logs...same object is used to 
> read the identifier and is stored in the token cache. This is wrong because 
> same object is getting updated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858733#action_12858733
 ] 

Konstantin Shvachko commented on HDFS-909:
--

Some more:
What {{org.apache.tools.ant.taskdefs.WaitFor}} is used for?
And there is an blank line change at the end of FSEditLog.


> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858731#action_12858731
 ] 

Konstantin Shvachko commented on HDFS-909:
--

Todd, I tried to run TestEditLogRace with your 0.20 patch. It runs forever and 
finaly times out.
Feels like it does a lot of transactions. Could you please verify.
In the trunk the same test runs 42 secs. 
Also you have some debug printouts in the patch, like "= CLOSE DONE", and 
you use FSnamesystem.LOG for logging in the test.
The latter is confising, as then you'd expect a message from FSNamesystem while 
it comes from the test.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-1096) allow dfsadmin/mradmin refresh of superuser proxy group mappings

2010-04-20 Thread Boris Shkolnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Shkolnik updated HDFS-1096:
-

Attachment: HDFS-1096-BP20-7.patch

combined the fix into one patch for prev version.

> allow dfsadmin/mradmin refresh of superuser proxy group mappings
> 
>
> Key: HDFS-1096
> URL: https://issues.apache.org/jira/browse/HDFS-1096
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Boris Shkolnik
>Assignee: Boris Shkolnik
> Attachments: HDFS-1096-BP20-4.patch, HDFS-1096-BP20-6-fix.patch, 
> HDFS-1096-BP20-6.patch, HDFS-1096-BP20-7.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log

2010-04-20 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-909:
-

Attachment: hdfs-909-branch-0.21.txt

Not sure how much 0.21 is abandoned. I hear people use it with HBase. Here is 
the patch.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write 
> operations  corrupts edits log
> -
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
>Reporter: Cosmin Lehene
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-branch-0.20.txt, hdfs-909-branch-0.21.txt, 
> hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, 
> hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation 
> resulting in OP_INVALID end-of-file marker being initially overwritten by the 
> concurrent (in setReadyToFlush) threads and then removed twice from the 
> buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> 
> FSEditLog.closeStream() -> EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() 
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> 
> FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> 
> FSEditLog.closeStream() ->EditLogOutputStream.flush() -> 
> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> 
> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the 
> FSEdits.logSync() method level. However at a lower level access to 
> EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT 
> synchronized. These can be called from concurrent threads like in the example 
> above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write 
> operation it can race for EditLogFileOutputStream.setReadyToFlush that will 
> overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the 
> "end-of-file marker") and then remove it twice (from each thread) in 
> flushAndSync()! Hence there will be a valid byte missing from the edits log 
> that leads to a SecondaryNameNode silent failure and a full HDFS failure upon 
> cluster restart. 
> We got to this point after investigating a corrupted edits file that made 
> HDFS unable to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but 
> writables.length is 768. 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1031) Enhance the webUi to list a few of the corrupted files in HDFS

2010-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HDFS-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858718#action_12858718
 ] 

André Oriani commented on HDFS-1031:


One doubt regarding sorting the output. On my work, l10n is a daily concern, so 
I pay attention to that when coding.
We're going to compare Paths. Path.compareTo method delegates its 
implementation to java.net.URI.compareTo, which uses String.compareTo for path 
elements. That method of String class does not uses collation.

So for the program:

{code}
import java.net.URI;
import java.net.URISyntaxException;
import java.util.Arrays;

public class PathComparisons {
public static void main(String[] args) {
try {
URI paths[] = { new URI("file:///b/a"), new 
URI("file:///a/c"),
new URI("file:///a/b"), new 
URI("file:///b/z"),
new URI("file:///a/á"), new 
URI("file:///b/ç"),
new URI("file:///a/b/c/d/e")};
Arrays.sort(paths);
for (URI path : paths) 
{System.out.println(path.toString());}
} catch (URISyntaxException e) {
}
}
}
{code}

The output is 
{noformat}
file:///a/b
file:///a/b/c/d/e
file:///a/c
file:///a/á
file:///b/a
file:///b/z
file:///b/ç
{noformat}

I mean , the character order is {a,b,z,á,ç} instead of  an expected 
{a,á,b,c,ç,z}
 
Should I handle this or treat it as a minor side-effect ?


> Enhance the webUi to list a few of the corrupted files in HDFS
> --
>
> Key: HDFS-1031
> URL: https://issues.apache.org/jira/browse/HDFS-1031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: André Oriani
> Attachments: hdfs-1031_aoriani.patch, hdfs-1031_aoriani_2.patch, 
> hdfs-1031_aoriani_3.patch
>
>
> The existing webUI displays something like this:
> WARNING : There are about 12 missing blocks. Please check the log or run 
> fsck. 
> It would be nice if we can display the filenames that have missing blocks. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.