date:20100830


[ 
https://issues.apache.org/jira/browse/HDFS-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904249#action_12904249
 ] 

Konstantin Shvachko commented on HDFS-1084:
---

Two tests failed. testFilePermission() with the same message. And 
testErrOutPut():
{code}
Testcase: testErrOutPut took 12.357 sec
FAILED
cat does not print exceptions 
junit.framework.AssertionFailedError: cat does not print exceptions 
at 
org.apache.hadoop.hdfs.TestDFSShell.testErrOutPut(TestDFSShell.java:325)
{code}

 TestDFSShell fails in trunk.
 

 Key: HDFS-1084
 URL: https://issues.apache.org/jira/browse/HDFS-1084
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Konstantin Shvachko
 Fix For: 0.22.0


 {{TestDFSShell.testFilePermissions()}} fails on an assert attached below. I 
 see it on my Linux box. Don't see it failing with Hudson, and the same test 
 runs fine in 0.21 branch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1152) appendFile does not recheck lease in second synchronized block

[
https://issues.apache.org/jira/browse/HDFS-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904280#action_12904280
]

Konstantin Shvachko commented on HDFS-1152:
---

Todd, it seems to me that the two synchronized sections should actually be one.
There is some redundant work done in the second section, and also the
consistency of actions is not clear to me, because if you release the lock in
the middle then you basically need to verify everything again, say lease may
expire or permissions can change. Looks to me that either we should move the
second section inside startFileInternal() or remove it if this is a redundant
code.
Let's target this jira for 0.20-append. I'll create another one for trunk and
will take a closer look at the problem.

appendFile does not recheck lease in second synchronized block
--

Key: HDFS-1152
URL: https://issues.apache.org/jira/browse/HDFS-1152
Project: Hadoop HDFS
Issue Type: Bug
Components: name-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Attachments: appendFile-recheck-lease.txt, hdfs-1152.txt

FSN.appendFile is made up of two synchronized sections. The second section
assumes that the file has not gotten modified during the unsynchronized part
in between. We should recheck the lease in the second block.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete


[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904347#action_12904347
 ] 

Konstantin Shvachko commented on HDFS-:
---

The patch review comments.

FSNamesystem
# CorruptFileBlockInfo should be a static non-public class.
# I'd rather make it store the the file path and the block itself, rather than 
it's name.
The block name can be constructed in Fsck during printing. 
{code}
static class CorruptFileBlockInfo {
  String path;
  Block block;

  public String toString() {
return block.getBlockName() + \t + path;
  }
}
{code}
The method may be used in other tools, as Dhruba stated, so it is better to 
keep the original structures.
# Methods listCorruptFileBlocks() should return 
{{CollectionCorruptFileBlockInfo}} rather than {{CorruptFileBlockInfo[]}}. 
Saves conversion to an array.
# NamenodeFsck unused imports: FileStatus, Path
# TestFileCorruption unused imports: FileStatus, ClientProtocol
# DFSck unused imports: DFSConfigKeys

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

[
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904359#action_12904359
]

Konstantin Shvachko commented on HDFS-:
---

Dhruba. I'd be glad to discuss how to make corrupted blocks available to
RaidNode in a subsequent jira.
Currently the method in ClientProtocl lacks the functionality stated in this
jira, and does not conform to the design proposed by Sriram. So it has to be
replaced by another method or removed. As it is not used anywhere in the code
I'd prefer that it is removed from the protocol until the exact api is
developed. It looks like this effort should be coordinated with MAPREDUCE-2036
so that both RAID solutions could benefit from it.
Would you agree?

getCorruptFiles() should give some hint that the list is not complete
-

Key: HDFS-
URL: https://issues.apache.org/jira/browse/HDFS-
Project: Hadoop HDFS
Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
Fix For: 0.22.0

Attachments: HADFS-.0.patch, HDFS--y20.1.patch,
HDFS--y20.2.patch, HDFS-.trunk.patch

If the list of corruptfiles returned by the namenode doesn't say anything if
the number of corrupted files is larger than the call output limit (which
means the list is not complete). There should be a way to hint incompleteness
to clients.
A simple hack would be to add an extra entry to the array returned with the
value null. Clients could interpret this as a sign that there are other
corrupt files in the system.
We should also do some rephrasing of the fsck output to make it more
confident when the list is not complete and less confident when the list is
known to be incomplete.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

2010-08-30 Thread dhruba borthakur (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904363#action_12904363
 ] 

dhruba borthakur commented on HDFS-:


The patch posted by Sriram look good, however I would add the new API 
listCorruptFilesAndBlocks to ClientProtocol so that tools can use it.

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete


[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904366#action_12904366
 ] 

Konstantin Shvachko commented on HDFS-:
---

Tools cannot use ClientProtocol directly as discussed above. More changes will 
be required to make it usable. Can it be done in another jira?

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

2010-08-30 Thread dhruba borthakur (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904375#action_12904375
 ] 

dhruba borthakur commented on HDFS-:


 More changes will be required to make it usable

what are those? can you pl list them (again), thanks.

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete


[ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904403#action_12904403
 ] 

Konstantin Shvachko commented on HDFS-:
---

This 2 comments by 
[Sanjay|https://issues.apache.org/jira/browse/HDFS-?focusedCommentId=12883369page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12883369]
 and 
[me|https://issues.apache.org/jira/browse/HDFS-?focusedCommentId=12893457page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12893457]
 should summarize the discussion about ClientProtocol changes.
ClientProtocol, DFSClient, DistributedFileSystem, FileSystem, FileContect... - 
this classes may be affected by the changes. 
The main problem is that
- there is no clear idea / description of how this api will be exposed to 
RaidNode,
- therefore, it is not clear what is the scope of the changes,
- the changes intended for RaidNode are not related to Fsck, and shouldn't be 
mixed in with this functionality.

You say (plural) tools, do you have other than RaidNode tools in mind?

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-08-30 Thread Tsz Wo (Nicholas), SZE (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904416#action_12904416
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-1057:
--

If there is no follow-up on the test failure, I suggest that we should first 
revert the committed patch and then fix the problem here.

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Fix For: 0.20-append, 0.21.0, 0.22.0

 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
 hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
 hdfs-1057-trunk-6.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HDFS-1351) Make it possible for BlockPlacementPolicy to return null

2010-08-30 Thread Dmytro Molkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmytro Molkov resolved HDFS-1351.
-

Resolution: Invalid

Sorry, after talking with Hairong I realized that it will not be possible to 
make the fix this easy. The reason return value cannot be null is that at this 
point NameNode knows it has to delete extra replica of the block. If we skip 
this deletion it will not know the block has extra replicas until the next time 
it does full rescan (on restart). So this jira is itself invalid.

 Make it possible for BlockPlacementPolicy to return null
 

 Key: HDFS-1351
 URL: https://issues.apache.org/jira/browse/HDFS-1351
 Project: Hadoop HDFS
  Issue Type: Test
  Components: name-node
Affects Versions: 0.22.0
Reporter: Dmytro Molkov
Assignee: Dmytro Molkov
 Attachments: HDFS-1351.patch


 The idea is to modify FSNamesystem.chooseExcessReplicates code, so it can 
 accept a null return from chooseReplicaToDelete which will indicate that 
 NameNode should not be deleting extra replicas.
 One possible usecase - if there are nodes being added to the cluster that 
 might have corrupt replicas on them you do not want to delete other replicas 
 until the block scanner finished scanning every block on the datanode.
 This will require additional work on the implementation of the 
 BlockPlacementPolicy, but with this JIRA I just wanted to create a basis for 
 future improvements.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

2010-08-30 Thread Ramkumar Vadali (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904431#action_12904431
]

Ramkumar Vadali commented on HDFS-:
---

The RaidNode use case at a high level is to identify corrupted data that can be
fixed by using parity data.

This can be achieved by:
1. Getting a list of corrupt files and subsequently identifying the corrupt
blocks in each corrupt file. The current getCorruptFiles() RPC enables getting
the list of corrupt files.
-OR-
2. Getting a list of corrupt files annotated by the corrupt blocks. If this
patch introduced a RPC with that functionality, it would be an improvement over
the getCorruptFiles() RPC.

I have a patch for https://issues.apache.org/jira/browse/HDFS-1171 that depends
on the getCorruptFiles() RPC, so removal of that RPC with no substitute would
mean loss of functionality.

getCorruptFiles() should give some hint that the list is not complete
-

Key: HDFS-
URL: https://issues.apache.org/jira/browse/HDFS-
Project: Hadoop HDFS
Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
Fix For: 0.22.0

Attachments: HADFS-.0.patch, HDFS--y20.1.patch,
HDFS--y20.2.patch, HDFS-.trunk.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1353) Optimize number of block access tokens returned by getBlockLocations

2010-08-30 Thread Jakob Homan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated HDFS-1353:
--

Fix Version/s: 0.21.1
   (was: 0.22.0)
Affects Version/s: 0.21.0
   (was: 0.22.0)

 Optimize number of block access tokens returned by getBlockLocations
 

 Key: HDFS-1353
 URL: https://issues.apache.org/jira/browse/HDFS-1353
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 0.21.0
Reporter: Jakob Homan
Assignee: Jakob Homan
 Fix For: 0.21.1


 HDFS-1081 optimized the number of block access tokens (BATs) created in a 
 single call to getBlockLocations, as this is an expensive operation.  
 However, that JIRA put off another optimization which was then made possible, 
 which is to just send a single block access token across the wire (and 
 maintain a single BAT on the client side).  This JIRA is for implementing 
 that optimization.  Since a single BAT is generated for all the blocks, we 
 just write that single BAT to the wire, rather than writing n BATs for n 
 blocks, as is currently done.  This turns out to be a useful optimization for 
 files with very large numbers of blocks, as the new lone BAT is much larger 
 than was a BAT previously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-08-30 Thread dhruba borthakur (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1290#action_1290
 ] 

dhruba borthakur commented on HDFS-1057:


I agree with Nicholas, in the worst case shall we get rid of 
TestFileConcurrentReader itself?

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Fix For: 0.20-append, 0.21.0, 0.22.0

 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
 hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
 hdfs-1057-trunk-6.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

[
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904449#action_12904449
]

Konstantin Shvachko commented on HDFS-:
---

I don't see the patch. And I don't know how the RPC *alone* can enable the
functionality. I think this discussion should go on within HDFS-1171, where you
can make a case as Sanjay suggested for introducing the RPC, which has never
been done yet. What is wrong with adding the call in HDFS-1171?

getCorruptFiles() should give some hint that the list is not complete
-

Key: HDFS-
URL: https://issues.apache.org/jira/browse/HDFS-
Project: Hadoop HDFS
Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
Fix For: 0.22.0

Attachments: HADFS-.0.patch, HDFS--y20.1.patch,
HDFS--y20.2.patch, HDFS-.trunk.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file


[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904455#action_12904455
 ] 

sam rash commented on HDFS-1057:


I'm confused--the jira for the test result indicated you had solved the 
problem.  Can you let me know what you need me to do?


 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Fix For: 0.20-append, 0.21.0, 0.22.0

 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
 hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
 hdfs-1057-trunk-6.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete

[
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sriram Rao updated HDFS-:
-

Attachment: HDFS-.trunk.1.patch

Attached is an updated that addresses all the issues that Konstantin had
pointed out. Thanks Konstantin.

I also refactored the tests in TestFileCorruption.java that were related to
this issue (listCorruptFileBlocks) and made them a separate test
(TestListCorruptFileBlocks.java) that tests this feature.

getCorruptFiles() should give some hint that the list is not complete
-

Key: HDFS-
URL: https://issues.apache.org/jira/browse/HDFS-
Project: Hadoop HDFS
Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
Fix For: 0.22.0

Attachments: HADFS-.0.patch, HDFS--y20.1.patch,
HDFS--y20.2.patch, HDFS-.trunk.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete


 [ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriram Rao updated HDFS-:
-

Attachment: (was: HDFS-.trunk.patch)

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete


 [ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriram Rao updated HDFS-:
-

Attachment: HDFS-.trunk.patch

Re-attaching the original patch.  Deleted the wrong one.

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete


 [ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriram Rao updated HDFS-:
-

Attachment: (was: HDFS-.trunk.1.patch)

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HDFS-1111) getCorruptFiles() should give some hint that the list is not complete


 [ 
https://issues.apache.org/jira/browse/HDFS-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriram Rao updated HDFS-:
-

Attachment: HDFS-.trunk.1.patch

Re-attaching the patch that addresses the comments from Konstantin and 
including the testcase that was added for this issue.

 getCorruptFiles() should give some hint that the list is not complete
 -

 Key: HDFS-
 URL: https://issues.apache.org/jira/browse/HDFS-
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 0.22.0
Reporter: Rodrigo Schmidt
Assignee: Sriram Rao
 Fix For: 0.22.0

 Attachments: HADFS-.0.patch, HDFS--y20.1.patch, 
 HDFS--y20.2.patch, HDFS-.trunk.1.patch, HDFS-.trunk.patch


 If the list of corruptfiles returned by the namenode doesn't say anything if 
 the number of corrupted files is larger than the call output limit (which 
 means the list is not complete). There should be a way to hint incompleteness 
 to clients.
 A simple hack would be to add an extra entry to the array returned with the 
 value null. Clients could interpret this as a sign that there are other 
 corrupt files in the system.
 We should also do some rephrasing of the fsck output to make it more 
 confident when the list is not complete and less confident when the list is 
 known to be incomplete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HDFS-1310) TestFileConcurrentReader fails


 [ 
https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash reassigned HDFS-1310:
--

Assignee: sam rash

 TestFileConcurrentReader fails
 --

 Key: HDFS-1310
 URL: https://issues.apache.org/jira/browse/HDFS-1310
 Project: Hadoop HDFS
  Issue Type: Test
Affects Versions: 0.22.0
Reporter: Suresh Srinivas
Assignee: sam rash

 For details of test failure see 
 http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails


[ 
https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904472#action_12904472
 ] 

sam rash commented on HDFS-1310:


sorry for the delay, I skimmed this jira and the last comment contradicted the 
title, so I assumed it was in ok shape.
I will have a minute to look at this in more detail tomorrow night


 TestFileConcurrentReader fails
 --

 Key: HDFS-1310
 URL: https://issues.apache.org/jira/browse/HDFS-1310
 Project: Hadoop HDFS
  Issue Type: Test
Affects Versions: 0.22.0
Reporter: Suresh Srinivas
Assignee: sam rash

 For details of test failure see 
 http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails

2010-08-30 Thread Tanping Wang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904483#action_12904483
]

Tanping Wang commented on HDFS-1310:

Hi, Sam, sorry about the confusion. In my last comment, I mentioned that I ran
each one test case( out of the total seven tests in TestFileConcurrentReader),
one by one, *individually*. .i.e. commenting other six tests out, but only
leaving one test running each time. Each single one test got passed. However,
if I run seven tests from TestFileConcurrentReader together, the last three (
sometimes, two) tests fail.
The last three tests are testUnfinishedBlockCRCErrorTransferToVerySmallWrite,
testUnfinishedBlockCRCErrorNormalTransfer and
testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite.

Please look at the Hudson result for reference,

https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorTransferToVerySmallWrite/
https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransfer/
https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite/

The first test case failed because of java.io.IOException: Too many open
files.
The next two ( sometimes one ) tests fail due to Cannot lock storage ... The
directory is already locked.
It seems to me that the test runs into race condition and does not release
resources properly.

TestFileConcurrentReader fails
--

Key: HDFS-1310
URL: https://issues.apache.org/jira/browse/HDFS-1310
Project: Hadoop HDFS
Issue Type: Test
Affects Versions: 0.22.0
Reporter: Suresh Srinivas
Assignee: sam rash

For details of test failure see
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/218/testReport/

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file

2010-08-30 Thread Tanping Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904486#action_12904486
 ] 

Tanping Wang commented on HDFS-1057:


Hi, Sam, I have just commented on HDFS-1310.  Would you please look at

Hudson result for reference,

https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorTransferToVerySmallWrite/
https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransfer/
https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite/
Thanks!

 Concurrent readers hit ChecksumExceptions if following a writer to very end 
 of file
 ---

 Key: HDFS-1057
 URL: https://issues.apache.org/jira/browse/HDFS-1057
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node
Affects Versions: 0.20-append, 0.21.0, 0.22.0
Reporter: Todd Lipcon
Assignee: sam rash
Priority: Blocker
 Fix For: 0.20-append, 0.21.0, 0.22.0

 Attachments: conurrent-reader-patch-1.txt, 
 conurrent-reader-patch-2.txt, conurrent-reader-patch-3.txt, 
 HDFS-1057-0.20-append.patch, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt, 
 hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt, 
 hdfs-1057-trunk-6.txt


 In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before 
 calling flush(). Therefore, if there is a concurrent reader, it's possible to 
 race here - the reader will see the new length while those bytes are still in 
 the buffers of BlockReceiver. Thus the client will potentially see checksum 
 errors or EOFs. Additionally, the last checksum chunk of the file is made 
 accessible to readers even though it is not stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-1310) TestFileConcurrentReader fails