[jira] [Commented] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178044#comment-14178044 ] Plamen Jeliazkov commented on HDFS-3107: [~srivas], There is no plan to grow the file by padding it with zeroes as general-purpose truncate does. Both [~shv] and [~lei_chang] mentioned this in their design docs, I believe. [~cmccabe], While the copying the last block up to its truncate point and doing a delete/concat is definitely a simpler overall approach, the full truncate implementation has the benefit of being a single NameNode RPC call that can both truncate in-place and copy-on-truncate, preserving the original last block and moving the 'copy&truncate' work to the DataNodes themselves (as opposed to having to pass data through the network / client). I am not intending to debate either implementation -- I like both personally; just wanted to explain as briefly as I could why Konstantin and I are taking our approach. > HDFS truncate > - > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Lei Chang >Assignee: Plamen Jeliazkov > Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, > HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, > editsStored, editsStored, editsStored.xml > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7254) Add documents for hot swap drive
[ https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178009#comment-14178009 ] Fengdong Yu commented on HDFS-7254: --- bq.<<>> should be dfs.datanode.data.dir > Add documents for hot swap drive > > > Key: HDFS-7254 > URL: https://issues.apache.org/jira/browse/HDFS-7254 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 2.5.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch > > > Add documents for the hot swap drive functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7056) Snapshot support for truncate
[ https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guo Ruijing updated HDFS-7056: -- Attachment: HDFSSnapshotWithTruncateDesign.docx Attached HDFS Snapshot With Truncate Design for reference/review. > Snapshot support for truncate > - > > Key: HDFS-7056 > URL: https://issues.apache.org/jira/browse/HDFS-7056 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Konstantin Shvachko > Attachments: HDFSSnapshotWithTruncateDesign.docx > > > Implementation of truncate in HDFS-3107 does not allow truncating files which > are in a snapshot. It is desirable to be able to truncate and still keep the > old file state of the file in the snapshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7225) Failed DataNode lookup can crash NameNode with NullPointerException
[ https://issues.apache.org/jira/browse/HDFS-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177948#comment-14177948 ] Zhe Zhang commented on HDFS-7225: - [~andrew.wang] Thanks for the suggestion. I think that's a good idea. It assumes that the NN will make the same decision to invalidate those blocks when the volume is back. I think it's a valid assumption. I'll implement that option. > Failed DataNode lookup can crash NameNode with NullPointerException > --- > > Key: HDFS-7225 > URL: https://issues.apache.org/jira/browse/HDFS-7225 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7225-v1.patch > > > {{BlockManager#invalidateWorkForOneNode}} looks up a DataNode by the > {{datanodeUuid}} and passes the resultant {{DatanodeDescriptor}} to > {{InvalidateBlocks#invalidateWork}}. However, if a wrong or outdated > {{datanodeUuid}} is used, a null pointer will be passed to {{invalidateWork}} > which will use it to lookup in a {{TreeMap}}. Since the key type is > {{DatanodeDescriptor}}, key comparison is based on the IP address. A null key > will crash the NameNode with an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone
[ https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177941#comment-14177941 ] Yi Liu commented on HDFS-7243: -- Hi Charles, I was going to commit this patch just now, but found another issue, sorry for missing that in previous comments. {code} dir.getINodesInPath4Write(target, true); {code} we should call {code} dir.getINodesInPath4Write(target); {code} Since the later will hold the FsDir read lock. Besides, another small nits in the test, {code} fs.concat(new Path(ez, "target"), new Path[] { src1, src2 }); {code} We could use {{target}} instead of {{new Path(ez, "target")}} > HDFS concat operation should not be allowed in Encryption Zone > -- > > Key: HDFS-7243 > URL: https://issues.apache.org/jira/browse/HDFS-7243 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption, namenode >Affects Versions: 2.6.0 >Reporter: Yi Liu >Assignee: Charles Lamb > Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch, > HDFS-7243.003.patch > > > For HDFS encryption at rest, files in an encryption zone are using different > data encryption keys, so concat should be disallowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message
[ https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177942#comment-14177942 ] Hadoop QA commented on HDFS-3342: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676008/HDFS-3342.002.patch against trunk revision 7aab5fa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDecommission org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8467//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8467//console This message is automatically generated. > SocketTimeoutException in BlockSender.sendChunks could have a better error > message > -- > > Key: HDFS-3342 > URL: https://issues.apache.org/jira/browse/HDFS-3342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.0.0-alpha >Reporter: Todd Lipcon >Assignee: Yongjun Zhang >Priority: Minor > Labels: supportability > Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch, > HDFS-3342.002.patch > > > Currently, if a client connects to a DN and begins to read a block, but then > stops calling read() for a long period of time, the DN will log a > SocketTimeoutException "48 millis timeout while waiting for channel to be > ready for write." This is because there is no "keepalive" functionality of > any kind. At a minimum, we should improve this error message to be an INFO > level log which just says that the client likely stopped reading, so > disconnecting it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7269) NN and DN don't check whether corrupted blocks reported by clients are actually corrupted
[ https://issues.apache.org/jira/browse/HDFS-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177905#comment-14177905 ] Ming Ma commented on HDFS-7269: --- Nicholas, in our case, the client only reported one replica for each reportBadBlocks call. But given there were multiple DFSInputStream read calls for a given block and each read call could mark one replica bad, all replicas were marked as bad. > NN and DN don't check whether corrupted blocks reported by clients are > actually corrupted > - > > Key: HDFS-7269 > URL: https://issues.apache.org/jira/browse/HDFS-7269 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma > > We had a case where the client machine had memory issue and thus failed the > checksum validation of a given block for all its replicas. So the client > ended up informing NN about the corrupted blocks for all DNs via > reportBadBlocks. However, the block isn't corrupted on any of the DNs. You > can still use DFSClient to read the block. But in order to get rid of NN's > warning message for corrupt block, we either do a NN fail over, or repair the > file via a) copy the file somewhere, b) remove the file, c) copy the file > back. > It will be useful if NN and DN can validate client's report. In fact, there > is a comment in NamenodeRpcServer about this. > {noformat} > /** >* The client has detected an error on the specified located blocks >* and is reporting them to the server. For now, the namenode will >* mark the block as corrupt. In the future we might >* check the blocks are actually corrupt. >*/ > {noformat} > To allow system to recover from invalid client report quickly, we can support > automatic recovery or manual admins command. > 1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN > will notify the validate result via IBR and new > ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK. > 2. Some new admins command to move corrupted blocks out of BM's > CorruptReplicasMap and UnderReplicatedBlocks. > Appreciate any input. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7271) Find a way to make encryption zone deletion work with HDFS trash.
[ https://issues.apache.org/jira/browse/HDFS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu resolved HDFS-7271. -- Resolution: Invalid The "-skipTrash" already exists for rm op, so resolve it as invalid. > Find a way to make encryption zone deletion work with HDFS trash. > - > > Key: HDFS-7271 > URL: https://issues.apache.org/jira/browse/HDFS-7271 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption >Affects Versions: 2.6.0 >Reporter: Yi Liu >Assignee: Yi Liu > > Currently when HDFS trash is enabled, deletion of encryption zone will have > issue: > {quote} > rmr: Failed to move to trash: ... can't be moved from an encryption zone. > {quote} > A simple way is to add ignore trash flag for fs rm operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177900#comment-14177900 ] M. C. Srivas commented on HDFS-3107: Note that a general-purpose truncate can be used to also *increase* the size of the file. Used very often, for example, to implement a database and growing the file if it isn't large enough. Are you planning to implement truncate to behave so too? > HDFS truncate > - > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Lei Chang >Assignee: Plamen Jeliazkov > Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, > HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, > editsStored, editsStored, editsStored.xml > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7271) Find a way to make encryption zone deletion work with HDFS trash.
Yi Liu created HDFS-7271: Summary: Find a way to make encryption zone deletion work with HDFS trash. Key: HDFS-7271 URL: https://issues.apache.org/jira/browse/HDFS-7271 Project: Hadoop HDFS Issue Type: Bug Components: encryption Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Yi Liu Currently when HDFS trash is enabled, deletion of encryption zone will have issue: {quote} rmr: Failed to move to trash: ... can't be moved from an encryption zone. {quote} A simple way is to add ignore trash flag for fs rm operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7254) Add documents for hot swap drive
[ https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177888#comment-14177888 ] Hadoop QA commented on HDFS-7254: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675986/HDFS-7254.001.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.tools.TestDFSAdminWithHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8465//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8465//console This message is automatically generated. > Add documents for hot swap drive > > > Key: HDFS-7254 > URL: https://issues.apache.org/jira/browse/HDFS-7254 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 2.5.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch > > > Add documents for the hot swap drive functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177885#comment-14177885 ] Gopal V commented on HDFS-7266: --- That was quick!, thanks [~cmccabe] & [~andrew.wang]. > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.0 >Reporter: Gopal V >Assignee: Andrew Wang >Priority: Minor > Labels: multi-threading > Fix For: 2.7.0 > > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3107: --- Attachment: (was: HDFS-3107.008.patch) > HDFS truncate > - > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Lei Chang >Assignee: Plamen Jeliazkov > Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, > HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, > editsStored, editsStored, editsStored.xml > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3107: --- Attachment: HDFS-3107.008.patch fix log message which should be trace, not info > HDFS truncate > - > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Lei Chang >Assignee: Plamen Jeliazkov > Attachments: HDFS-3107.008.patch, HDFS-3107.008.patch, > HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS_truncate.pdf, > HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, > HDFS_truncate_semantics_Mar21.pdf, editsStored, editsStored, editsStored.xml > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3107: --- Attachment: HDFS-3107.008.patch Hi all, Here's a patch which implements truncate in such a way that it works with snapshots. This doesn't modify the last replica file of the truncated file in place. Instead, it writes out a new file with the new (shorter) contents of the last replica file, and uses concat to combine it with the first part of the file. > HDFS truncate > - > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Lei Chang >Assignee: Plamen Jeliazkov > Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, > HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, > editsStored, editsStored, editsStored.xml > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177853#comment-14177853 ] Yongjun Zhang commented on HDFS-7235: - HI [~cmccabe], Thanks again for the review. Please see my answer below. {quote} We shouldn't log a message saying that "the block file doesn't exist" if the block file exists, but is not finalized. {quote} We are not, we only log when the state is finalized, and the block file doesn't exist. {quote} I also don't see why we need to call FSDatasetSpi#getLength, if we already have access to the replica length here. {quote} The new fix we are introducing here is to handle a special case that when {{isValidBlock()}} returns false, so I tried to limit the change in the special handling block. If we remove the pre-exiisting {{FSDatasetSpi#getLength}}, we need to move the call {{getReplica()}} out of the false block. The {{getReplica()}} was marked {{@Deprecated}}, I consider calling it is a bit hack here already, Plus, we need to synchronize the whole block of code, so I hope we can limit the impact to within the false block. I wonder if this explanation makes sense to you. {quote} I would suggest having your synchronized section set a string named replicaProblem. Then if the string is null at the end, there is no problem-- otherwise, the problem is contained in replicaProblem. Then you can check existence, replica state, and length all at once. {quote} I am not sure I follow what you said, will check in person. {quote} We don't even need to call isValidBlock. getReplica gives you all the info you need. Please take out this call, since it's unnecessary. {quote} The {isValidBlock}} is an interface defined in FsDatasetSpi, and has its methods defined in derived classes FsDatasetImpl, and SimulatedFSDataset etc, which might have different implementation of the methods. It'd be nice to stick to the interface of FsDatasetSpi. Will discuss with you more. Thanks again. > Can not decommission DN which has invalid block due to bad disk > --- > > Key: HDFS-7235 > URL: https://issues.apache.org/jira/browse/HDFS-7235 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, > HDFS-7235.003.patch > > > When to decommission a DN, the process hangs. > What happens is, when NN chooses a replica as a source to replicate data on > the to-be-decommissioned DN to other DNs, it favors choosing this DN > to-be-decommissioned as the source of transfer (see BlockManager.java). > However, because of the bad disk, the DN would detect the source block to be > transfered as invalidBlock with the following logic in FsDatasetImpl.java: > {code} > /** Does the block exist and have the given state? */ > private boolean isValid(final ExtendedBlock b, final ReplicaState state) { > final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), > b.getLocalBlock()); > return replicaInfo != null > && replicaInfo.getState() == state > && replicaInfo.getBlockFile().exists(); > } > {code} > The reason that this method returns false (detecting invalid block) is > because the block file doesn't exist due to bad disk in this case. > The key issue we found here is, after DN detects an invalid block for the > above reason, it doesn't report the invalid block back to NN, thus NN doesn't > know that the block is corrupted, and keeps sending the data transfer request > to the same DN to be decommissioned, again and again. This caused an infinite > loop, so the decommission process hangs. > Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177828#comment-14177828 ] Hadoop QA commented on HDFS-7235: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675964/HDFS-7235.003.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1304 javac compiler warnings (more than the trunk's current 1293 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestRefreshNamenodes org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.namenode.ha.TestHAFsck org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.fs.TestSymlinkHdfsFileSystem {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8462//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8462//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8462//console This message is automatically generated. > Can not decommission DN which has invalid block due to bad disk > --- > > Key: HDFS-7235 > URL: https://issues.apache.org/jira/browse/HDFS-7235 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, > HDFS-7235.003.patch > > > When to decommission a DN, the process hangs. > What happens is, when NN chooses a replica as a source to replicate data on > the to-be-decommissioned DN to other DNs, it favors choosing this DN > to-be-decommissioned as the source of transfer (see BlockManager.java). > However, because of the bad disk, the DN would detect the source block to be > transfered as invalidBlock with the following logic in FsDatasetImpl.java: > {code} > /** Does the block exist and have the given state? */ > private boolean isValid(final ExtendedBlock b, final ReplicaState state) { > final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), > b.getLocalBlock()); > return replicaInfo != null > && replicaInfo.getState() == state > && replicaInfo.getBlockFile().exists(); > } > {code} > The reason that this method returns false (detecting invalid block) is > because the block file doesn't exist due to bad disk in this case. > The key issue we found here is, after DN detects an invalid block for the > above reason, it doesn't report the invalid block back to NN, thus NN doesn't > know that the block is corrupted, and keeps sending the data transfer request > to the same DN to be decommissioned, again and again. This caused an infinite > loop, so the decommission process hangs. > Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5928) show namespace and namenode ID on NN dfshealth page
[ https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177829#comment-14177829 ] Hadoop QA commented on HDFS-5928: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675962/HDFS-5928.v4.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1304 javac compiler warnings (more than the trunk's current 1293 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.fs.TestSymlinkHdfsFileSystem {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8461//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8461//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8461//console This message is automatically generated. > show namespace and namenode ID on NN dfshealth page > --- > > Key: HDFS-5928 > URL: https://issues.apache.org/jira/browse/HDFS-5928 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, > HDFS-5928.v4.patch, HDFS-5928.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck
[ https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177825#comment-14177825 ] Hadoop QA commented on HDFS-7180: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676001/HDFS-7180.001.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs-nfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8466//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8466//console This message is automatically generated. > NFSv3 gateway frequently gets stuck > --- > > Key: HDFS-7180 > URL: https://issues.apache.org/jira/browse/HDFS-7180 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.5.0 > Environment: Linux, Fedora 19 x86-64 >Reporter: Eric Zhiqiang Ma >Assignee: Brandon Li >Priority: Critical > Attachments: HDFS-7180.001.patch > > > We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway > on one node in the cluster to let users upload data with rsync. > However, we find the NFSv3 daemon seems frequently get stuck while the HDFS > seems working well. (hdfds dfs -ls and etc. works just well). The last stuck > we found is after around 1 day running and several hundreds GBs of data > uploaded. > The NFSv3 daemon is started on one node and on the same node the NFS is > mounted. > From the node where the NFS is mounted: > dmsg shows like this: > [1859245.368108] nfs: server localhost not responding, still trying > [1859245.368111] nfs: server localhost not responding, still trying > [1859245.368115] nfs: server localhost not responding, still trying > [1859245.368119] nfs: server localhost not responding, still trying > [1859245.368123] nfs: server localhost not responding, still trying > [1859245.368127] nfs: server localhost not responding, still trying > [1859245.368131] nfs: server localhost not responding, still trying > [1859245.368135] nfs: server localhost not responding, still trying > [1859245.368138] nfs: server localhost not responding, still trying > [1859245.368142] nfs: server localhost not responding, still trying > [1859245.368146] nfs: server localhost not responding, still trying > [1859245.368150] nfs: server localhost not responding, still trying > [1859245.368153] nfs: server localhost not responding, still trying > The mounted directory can not be `ls` and `df -hT` gets stuck too. > The latest lines from the nfs3 log in the hadoop logs directory: > 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > user map size: 35 > 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > group map size: 54 > 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:53:23,809 I
[jira] [Commented] (HDFS-7259) Unresponseive NFS mount point due to deferred COMMIT response
[ https://issues.apache.org/jira/browse/HDFS-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177821#comment-14177821 ] Jing Zhao commented on HDFS-7259: - Thanks for working on this, Brandon! The patch looks good to me. +1. > Unresponseive NFS mount point due to deferred COMMIT response > - > > Key: HDFS-7259 > URL: https://issues.apache.org/jira/browse/HDFS-7259 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.2.0 >Reporter: Brandon Li >Assignee: Brandon Li > Attachments: HDFS-7259.001.patch, HDFS-7259.002.patch > > > Since the gateway can't commit random write, it caches the COMMIT requests in > a queue and send back response only when the data can be committed or stream > timeout (failure in the latter case). This could cause problems two patterns: > (1) file uploading failure > (2) the mount dir is stuck on the same client, but other NFS clients can > still access NFS gateway. > The error pattern (2) is because there are too many COMMIT requests pending, > so the NFS client can't send any other requests(e.g., for "ls") to NFS > gateway with its pending requests limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7154) Fix returning value of starting reconfiguration task
[ https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7154: --- Resolution: Fixed Fix Version/s: 2.6.0 Status: Resolved (was: Patch Available) committed. Thanks, Eddy. > Fix returning value of starting reconfiguration task > > > Key: HDFS-7154 > URL: https://issues.apache.org/jira/browse/HDFS-7154 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 3.0.0, 2.6.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Fix For: 2.6.0 > > Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, > HDFS-7154.001.patch, HDFS-7154.001.patch > > > Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} > (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit > code. It is expected to return 0 to indicate success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7154) Fix returning value of starting reconfiguration task
[ https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177810#comment-14177810 ] Hudson commented on HDFS-7154: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6296 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6296/]) HDFS-7154. Fix returning value of starting reconfiguration task (Lei Xu via Colin P. McCabe) (cmccabe: rev 7aab5fa1bd9386b036af45cd8206622a4555d74a) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSAdmin.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSAdmin.java > Fix returning value of starting reconfiguration task > > > Key: HDFS-7154 > URL: https://issues.apache.org/jira/browse/HDFS-7154 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 3.0.0, 2.6.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Fix For: 2.6.0 > > Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, > HDFS-7154.001.patch, HDFS-7154.001.patch > > > Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} > (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit > code. It is expected to return 0 to indicate success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message
[ https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-3342: Attachment: HDFS-3342.002.patch The eclipse:eclipse build issue appears to be a glitch, upload same patch again to trigger another run. > SocketTimeoutException in BlockSender.sendChunks could have a better error > message > -- > > Key: HDFS-3342 > URL: https://issues.apache.org/jira/browse/HDFS-3342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.0.0-alpha >Reporter: Todd Lipcon >Assignee: Yongjun Zhang >Priority: Minor > Labels: supportability > Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch, > HDFS-3342.002.patch > > > Currently, if a client connects to a DN and begins to read a block, but then > stops calling read() for a long period of time, the DN will log a > SocketTimeoutException "48 millis timeout while waiting for channel to be > ready for write." This is because there is no "keepalive" functionality of > any kind. At a minimum, we should improve this error message to be an INFO > level log which just says that the client likely stopped reading, so > disconnecting it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177795#comment-14177795 ] Hudson commented on HDFS-7266: -- FAILURE: Integrated in Hadoop-trunk-Commit #6295 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6295/]) HDFS-7266. HDFS Peercache enabled check should not lock on object (awang via cmccabe) (cmccabe: rev 4799570dfdb7987c2ac39716143341e9a3d9b7d2) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/PeerCache.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.0 >Reporter: Gopal V >Assignee: Andrew Wang >Priority: Minor > Labels: multi-threading > Fix For: 2.7.0 > > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7254) Add documents for hot swap drive
[ https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177793#comment-14177793 ] Colin Patrick McCabe commented on HDFS-7254: +1. Thanks, Eddy. > Add documents for hot swap drive > > > Key: HDFS-7254 > URL: https://issues.apache.org/jira/browse/HDFS-7254 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 2.5.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch > > > Add documents for the hot swap drive functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177792#comment-14177792 ] Colin Patrick McCabe commented on HDFS-7235: {code} 1786 boolean needToReportBadBlock = false; 1787 synchronized(data) { 1788ReplicaInfo replicaInfo = (ReplicaInfo) data.getReplica( 1789block.getBlockPoolId(), block.getBlockId()); 1790needToReportBadBlock = (replicaInfo != null 1791&& replicaInfo.getState() == ReplicaState.FINALIZED 1792&& !replicaInfo.getBlockFile().exists()); 1793 } 1794 if (needToReportBadBlock) { 1795// Report back to NN bad block caused by non-existent block file. 1796reportBadBlock(bpos, block, "Can't replicate block " + block 1797+ " because the block file doesn't exist"); 1798 } else { 1799String errStr = "Can't send invalid block " + block; 1800LOG.info(errStr); 1801bpos.trySendErrorReport(DatanodeProtocol.INVALID_BLOCK, errStr); 1802 } {code} We shouldn't log a message saying that "the block file doesn't exist" if the block file exists, but is not finalized. I also don't see why we need to call {{FSDatasetSpi#getLength}}, if we already have access to the replica length here. I would suggest having your synchronized section set a string named {{replicaProblem}}. Then if the string is null at the end, there is no problem-- otherwise, the problem is contained in {{replicaProblem}}. Then you can check existence, replica state, and length all at once. bq. BTW, about the WATCH-OUT, I was just thinking that someone could add another condition in the FsDatasetImpl#isValidBlock that makes the method to return false. But that's remote and probably won't happen. We don't even need to call {{isValidBlock}}. {{getReplica}} gives you all the info you need. Please take out this call, since it's unnecessary. > Can not decommission DN which has invalid block due to bad disk > --- > > Key: HDFS-7235 > URL: https://issues.apache.org/jira/browse/HDFS-7235 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, > HDFS-7235.003.patch > > > When to decommission a DN, the process hangs. > What happens is, when NN chooses a replica as a source to replicate data on > the to-be-decommissioned DN to other DNs, it favors choosing this DN > to-be-decommissioned as the source of transfer (see BlockManager.java). > However, because of the bad disk, the DN would detect the source block to be > transfered as invalidBlock with the following logic in FsDatasetImpl.java: > {code} > /** Does the block exist and have the given state? */ > private boolean isValid(final ExtendedBlock b, final ReplicaState state) { > final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), > b.getLocalBlock()); > return replicaInfo != null > && replicaInfo.getState() == state > && replicaInfo.getBlockFile().exists(); > } > {code} > The reason that this method returns false (detecting invalid block) is > because the block file doesn't exist due to bad disk in this case. > The key issue we found here is, after DN detects an invalid block for the > above reason, it doesn't report the invalid block back to NN, thus NN doesn't > know that the block is corrupted, and keeps sending the data transfer request > to the same DN to be decommissioned, again and again. This caused an infinite > loop, so the decommission process hangs. > Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7269) NN and DN don't check whether corrupted blocks reported by clients are actually corrupted
[ https://issues.apache.org/jira/browse/HDFS-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177789#comment-14177789 ] Tsz Wo Nicholas Sze commented on HDFS-7269: --- By HDFS-1371, the client should not report checksum failure when all the nodes are bad. Do the files have only one replica in your case? > NN and DN don't check whether corrupted blocks reported by clients are > actually corrupted > - > > Key: HDFS-7269 > URL: https://issues.apache.org/jira/browse/HDFS-7269 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma > > We had a case where the client machine had memory issue and thus failed the > checksum validation of a given block for all its replicas. So the client > ended up informing NN about the corrupted blocks for all DNs via > reportBadBlocks. However, the block isn't corrupted on any of the DNs. You > can still use DFSClient to read the block. But in order to get rid of NN's > warning message for corrupt block, we either do a NN fail over, or repair the > file via a) copy the file somewhere, b) remove the file, c) copy the file > back. > It will be useful if NN and DN can validate client's report. In fact, there > is a comment in NamenodeRpcServer about this. > {noformat} > /** >* The client has detected an error on the specified located blocks >* and is reporting them to the server. For now, the namenode will >* mark the block as corrupt. In the future we might >* check the blocks are actually corrupt. >*/ > {noformat} > To allow system to recover from invalid client report quickly, we can support > automatic recovery or manual admins command. > 1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN > will notify the validate result via IBR and new > ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK. > 2. Some new admins command to move corrupted blocks out of BM's > CorruptReplicasMap and UnderReplicatedBlocks. > Appreciate any input. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7266: --- Resolution: Fixed Fix Version/s: 2.7.0 Target Version/s: 2.7.0 Status: Resolved (was: Patch Available) > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.0 >Reporter: Gopal V >Assignee: Andrew Wang >Priority: Minor > Labels: multi-threading > Fix For: 2.7.0 > > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1418#comment-1418 ] Colin Patrick McCabe commented on HDFS-7266: +1. Test failures look like HDFS-7226, not related. No new tests are needed because this is a small change to locking which is covered by the previous PeerCache tests. Will commit momentarily. Thanks Andrew and Gopal! > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.0 >Reporter: Gopal V >Assignee: Andrew Wang >Priority: Minor > Labels: multi-threading > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7215) Add JvmPauseMonitor to NFS gateway
[ https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1413#comment-1413 ] Colin Patrick McCabe commented on HDFS-7215: +1 for the current patch. Will commit tomorrow if nobody has any more comments. > Add JvmPauseMonitor to NFS gateway > -- > > Key: HDFS-7215 > URL: https://issues.apache.org/jira/browse/HDFS-7215 > Project: Hadoop HDFS > Issue Type: Improvement > Components: nfs >Affects Versions: 2.2.0 >Reporter: Brandon Li >Assignee: Brandon Li >Priority: Minor > Attachments: HDFS-7215.001.patch > > > Like NN/DN, a GC log would help debug issues in NFS gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7180) NFSv3 gateway frequently gets stuck
[ https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-7180: - Attachment: HDFS-7180.001.patch > NFSv3 gateway frequently gets stuck > --- > > Key: HDFS-7180 > URL: https://issues.apache.org/jira/browse/HDFS-7180 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.5.0 > Environment: Linux, Fedora 19 x86-64 >Reporter: Eric Zhiqiang Ma >Assignee: Brandon Li >Priority: Critical > Attachments: HDFS-7180.001.patch > > > We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway > on one node in the cluster to let users upload data with rsync. > However, we find the NFSv3 daemon seems frequently get stuck while the HDFS > seems working well. (hdfds dfs -ls and etc. works just well). The last stuck > we found is after around 1 day running and several hundreds GBs of data > uploaded. > The NFSv3 daemon is started on one node and on the same node the NFS is > mounted. > From the node where the NFS is mounted: > dmsg shows like this: > [1859245.368108] nfs: server localhost not responding, still trying > [1859245.368111] nfs: server localhost not responding, still trying > [1859245.368115] nfs: server localhost not responding, still trying > [1859245.368119] nfs: server localhost not responding, still trying > [1859245.368123] nfs: server localhost not responding, still trying > [1859245.368127] nfs: server localhost not responding, still trying > [1859245.368131] nfs: server localhost not responding, still trying > [1859245.368135] nfs: server localhost not responding, still trying > [1859245.368138] nfs: server localhost not responding, still trying > [1859245.368142] nfs: server localhost not responding, still trying > [1859245.368146] nfs: server localhost not responding, still trying > [1859245.368150] nfs: server localhost not responding, still trying > [1859245.368153] nfs: server localhost not responding, still trying > The mounted directory can not be `ls` and `df -hT` gets stuck too. > The latest lines from the nfs3 log in the hadoop logs directory: > 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > user map size: 35 > 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > group map size: 54 > 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update > cache now > 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not > doing static UID/GID mapping because '/etc/nfs.map' does not exist. > 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > user map size: 35 > 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > group map size: 54 > 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow > ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 > status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: > [10.0.3.172:50010, 10.0.3.176:50010] > 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 > java.io.IOException: Bad response ERROR for block > BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 fr
[jira] [Updated] (HDFS-7180) NFSv3 gateway frequently gets stuck
[ https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-7180: - Status: Patch Available (was: Open) > NFSv3 gateway frequently gets stuck > --- > > Key: HDFS-7180 > URL: https://issues.apache.org/jira/browse/HDFS-7180 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.5.0 > Environment: Linux, Fedora 19 x86-64 >Reporter: Eric Zhiqiang Ma >Assignee: Brandon Li >Priority: Critical > Attachments: HDFS-7180.001.patch > > > We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway > on one node in the cluster to let users upload data with rsync. > However, we find the NFSv3 daemon seems frequently get stuck while the HDFS > seems working well. (hdfds dfs -ls and etc. works just well). The last stuck > we found is after around 1 day running and several hundreds GBs of data > uploaded. > The NFSv3 daemon is started on one node and on the same node the NFS is > mounted. > From the node where the NFS is mounted: > dmsg shows like this: > [1859245.368108] nfs: server localhost not responding, still trying > [1859245.368111] nfs: server localhost not responding, still trying > [1859245.368115] nfs: server localhost not responding, still trying > [1859245.368119] nfs: server localhost not responding, still trying > [1859245.368123] nfs: server localhost not responding, still trying > [1859245.368127] nfs: server localhost not responding, still trying > [1859245.368131] nfs: server localhost not responding, still trying > [1859245.368135] nfs: server localhost not responding, still trying > [1859245.368138] nfs: server localhost not responding, still trying > [1859245.368142] nfs: server localhost not responding, still trying > [1859245.368146] nfs: server localhost not responding, still trying > [1859245.368150] nfs: server localhost not responding, still trying > [1859245.368153] nfs: server localhost not responding, still trying > The mounted directory can not be `ls` and `df -hT` gets stuck too. > The latest lines from the nfs3 log in the hadoop logs directory: > 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > user map size: 35 > 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > group map size: 54 > 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: > Have to change stable write to unstable write:FILE_SYNC > 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update > cache now > 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not > doing static UID/GID mapping because '/etc/nfs.map' does not exist. > 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > user map size: 35 > 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated > group map size: 54 > 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow > ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 > status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: > [10.0.3.172:50010, 10.0.3.176:50010] > 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 > java.io.IOException: Bad response ERROR for block > BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_6236
[jira] [Resolved] (HDFS-5131) Need a DEFAULT-like pipeline recovery policy that works for writers that flush
[ https://issues.apache.org/jira/browse/HDFS-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze resolved HDFS-5131. --- Resolution: Duplicate Resolving this as a duplicate of HDFS-4257. > Need a DEFAULT-like pipeline recovery policy that works for writers that flush > -- > > Key: HDFS-5131 > URL: https://issues.apache.org/jira/browse/HDFS-5131 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.0.6-alpha >Reporter: Mike Percy >Assignee: Tsz Wo Nicholas Sze > > The Hadoop 2 pipeline-recovery mechanism currently has four policies: DISABLE > (never do recovery), NEVER (never do recovery unless client asks for it), > ALWAYS (block until we have recovered the write pipeline to minimum > replication levels), and DEFAULT (try to do ALWAYS, but use a heuristic to > "give up" and allow writers to continue if not enough datanodes are available > to recover the pipeline). > The big problem with default is that it specifically falls back to ALWAYS > behavior if a client calls hflush(). On its face, it seems like a reasonable > thing to do, but in practice this means that clients like Flume (as well as, > I assume, HBase) just block when the cluster is low on datanodes. > In order to work around this issue, the easiest thing to do today is set the > policy to NEVER when using Flume to write to the cluster. But obviously > that's not ideal. > I believe what clients like Flume need is an additional policy which > essentially uses the heuristic logic used by DEFAULT even in cases where > long-lived writers call hflush(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7270) Implementing congestion control in writing pipeline
[ https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-7270: -- Component/s: datanode Issue Type: Improvement (was: Bug) > Implementing congestion control in writing pipeline > --- > > Key: HDFS-7270 > URL: https://issues.apache.org/jira/browse/HDFS-7270 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Haohui Mai >Assignee: Haohui Mai > > When a client writes to HDFS faster than the disk bandwidth of the DNs, it > saturates the disk bandwidth and put the DNs unresponsive. The client only > backs off by aborting / recovering the pipeline, which leads to failed writes > and unnecessary pipeline recovery. > This jira proposes to add explicit congestion control mechanisms in the > writing pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message
[ https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177735#comment-14177735 ] Hadoop QA commented on HDFS-3342: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675985/HDFS-3342.002.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8464//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8464//console This message is automatically generated. > SocketTimeoutException in BlockSender.sendChunks could have a better error > message > -- > > Key: HDFS-3342 > URL: https://issues.apache.org/jira/browse/HDFS-3342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.0.0-alpha >Reporter: Todd Lipcon >Assignee: Yongjun Zhang >Priority: Minor > Labels: supportability > Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch > > > Currently, if a client connects to a DN and begins to read a block, but then > stops calling read() for a long period of time, the DN will log a > SocketTimeoutException "48 millis timeout while waiting for channel to be > ready for write." This is because there is no "keepalive" functionality of > any kind. At a minimum, we should improve this error message to be an INFO > level log which just says that the client likely stopped reading, so > disconnecting it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7154) Fix returning value of starting reconfiguration task
[ https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177728#comment-14177728 ] Hadoop QA commented on HDFS-7154: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672829/HDFS-7154.001.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8460//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8460//console This message is automatically generated. > Fix returning value of starting reconfiguration task > > > Key: HDFS-7154 > URL: https://issues.apache.org/jira/browse/HDFS-7154 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 3.0.0, 2.6.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, > HDFS-7154.001.patch, HDFS-7154.001.patch > > > Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} > (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit > code. It is expected to return 0 to indicate success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7254) Add documents for hot swap drive
[ https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-7254: Attachment: HDFS-7254.001.patch [~cmccabe] Thanks for your reviews. I have made the changes accordingly. Could you take another look of the patch? > Add documents for hot swap drive > > > Key: HDFS-7254 > URL: https://issues.apache.org/jira/browse/HDFS-7254 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 2.5.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch > > > Add documents for the hot swap drive functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message
[ https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177704#comment-14177704 ] Yongjun Zhang commented on HDFS-3342: - HI [~andrew.wang], Thanks a lot for the review and comments! Good catch of yours. Indeed, if user set the log level to WARN, then the new message I added won't be seen. The "WARN" message was there before I made this change, and it's intended to report the stack trace all IOException. The new message I added tried to say "Likely the client has stopped reading..". When there is a SocketTimeoutException, I guess there may be other cases of SocketTimeoutException than the one we are dealing here. I was worried that taking out the WARN message would cause missed reporting of some other cases. That's why I used the word "Likely". To address your comment, I added similar statement to the WARN msg and uploaded a new rev (002), so similar msg will be printed at WARN log level. I wonder whether it looks good to you. Thanks again. > SocketTimeoutException in BlockSender.sendChunks could have a better error > message > -- > > Key: HDFS-3342 > URL: https://issues.apache.org/jira/browse/HDFS-3342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.0.0-alpha >Reporter: Todd Lipcon >Assignee: Yongjun Zhang >Priority: Minor > Labels: supportability > Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch > > > Currently, if a client connects to a DN and begins to read a block, but then > stops calling read() for a long period of time, the DN will log a > SocketTimeoutException "48 millis timeout while waiting for channel to be > ready for write." This is because there is no "keepalive" functionality of > any kind. At a minimum, we should improve this error message to be an INFO > level log which just says that the client likely stopped reading, so > disconnecting it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message
[ https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-3342: Attachment: HDFS-3342.002.patch > SocketTimeoutException in BlockSender.sendChunks could have a better error > message > -- > > Key: HDFS-3342 > URL: https://issues.apache.org/jira/browse/HDFS-3342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.0.0-alpha >Reporter: Todd Lipcon >Assignee: Yongjun Zhang >Priority: Minor > Labels: supportability > Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch > > > Currently, if a client connects to a DN and begins to read a block, but then > stops calling read() for a long period of time, the DN will log a > SocketTimeoutException "48 millis timeout while waiting for channel to be > ready for write." This is because there is no "keepalive" functionality of > any kind. At a minimum, we should improve this error message to be an INFO > level log which just says that the client likely stopped reading, so > disconnecting it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7254) Add documents for hot swap drive
[ https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177691#comment-14177691 ] Colin Patrick McCabe commented on HDFS-7254: {code} DataNode supports hot swappable drives. The user can add or replace HDFS data {code} Should be "the Datanode" {code} * The user installs the new hard drives, formats them and mounts them appropriately. Optional. {code} This seems a bit confusing. Surely formatting and mounting appropriately is not optional? Maybe this should be described as "If there are new storage directories, the user should format them and mount them appropriately." The rest looks good. > Add documents for hot swap drive > > > Key: HDFS-7254 > URL: https://issues.apache.org/jira/browse/HDFS-7254 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 2.5.1 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-7254.000.patch > > > Add documents for the hot swap drive functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7257) Add the time of last HA state transition to NN's /jmx page
[ https://issues.apache.org/jira/browse/HDFS-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177679#comment-14177679 ] Hadoop QA commented on HDFS-7257: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675942/HDFS-7257.002.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8459//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8459//console This message is automatically generated. > Add the time of last HA state transition to NN's /jmx page > -- > > Key: HDFS-7257 > URL: https://issues.apache.org/jira/browse/HDFS-7257 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7257.001.patch, HDFS-7257.002.patch > > > It would be useful to some monitoring apps to expose the last HA transition > time in the NN's /jmx page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7259) Unresponseive NFS mount point due to deferred COMMIT response
[ https://issues.apache.org/jira/browse/HDFS-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177669#comment-14177669 ] Hadoop QA commented on HDFS-7259: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675970/HDFS-7259.002.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-nfs hadoop-hdfs-project/hadoop-hdfs-nfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8463//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8463//console This message is automatically generated. > Unresponseive NFS mount point due to deferred COMMIT response > - > > Key: HDFS-7259 > URL: https://issues.apache.org/jira/browse/HDFS-7259 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.2.0 >Reporter: Brandon Li >Assignee: Brandon Li > Attachments: HDFS-7259.001.patch, HDFS-7259.002.patch > > > Since the gateway can't commit random write, it caches the COMMIT requests in > a queue and send back response only when the data can be committed or stream > timeout (failure in the latter case). This could cause problems two patterns: > (1) file uploading failure > (2) the mount dir is stuck on the same client, but other NFS clients can > still access NFS gateway. > The error pattern (2) is because there are too many COMMIT requests pending, > so the NFS client can't send any other requests(e.g., for "ls") to NFS > gateway with its pending requests limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7259) Unresponseive NFS mount point due to deferred COMMIT response
[ https://issues.apache.org/jira/browse/HDFS-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-7259: - Attachment: HDFS-7259.002.patch Uploaded a new patch to fix the unit tests. > Unresponseive NFS mount point due to deferred COMMIT response > - > > Key: HDFS-7259 > URL: https://issues.apache.org/jira/browse/HDFS-7259 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.2.0 >Reporter: Brandon Li >Assignee: Brandon Li > Attachments: HDFS-7259.001.patch, HDFS-7259.002.patch > > > Since the gateway can't commit random write, it caches the COMMIT requests in > a queue and send back response only when the data can be committed or stream > timeout (failure in the latter case). This could cause problems two patterns: > (1) file uploading failure > (2) the mount dir is stuck on the same client, but other NFS clients can > still access NFS gateway. > The error pattern (2) is because there are too many COMMIT requests pending, > so the NFS client can't send any other requests(e.g., for "ls") to NFS > gateway with its pending requests limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177618#comment-14177618 ] Yongjun Zhang commented on HDFS-7235: - HI [~cmccabe], Thanks for the review! I just uploaded rev 003 to address all the comments. BTW, about the WATCH-OUT, I was just thinking that someone could add another condition in the {{FsDatasetImpl#isValidBlock}} that makes the method to return false. But that's remote and probably won't happen. Thanks again. > Can not decommission DN which has invalid block due to bad disk > --- > > Key: HDFS-7235 > URL: https://issues.apache.org/jira/browse/HDFS-7235 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, > HDFS-7235.003.patch > > > When to decommission a DN, the process hangs. > What happens is, when NN chooses a replica as a source to replicate data on > the to-be-decommissioned DN to other DNs, it favors choosing this DN > to-be-decommissioned as the source of transfer (see BlockManager.java). > However, because of the bad disk, the DN would detect the source block to be > transfered as invalidBlock with the following logic in FsDatasetImpl.java: > {code} > /** Does the block exist and have the given state? */ > private boolean isValid(final ExtendedBlock b, final ReplicaState state) { > final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), > b.getLocalBlock()); > return replicaInfo != null > && replicaInfo.getState() == state > && replicaInfo.getBlockFile().exists(); > } > {code} > The reason that this method returns false (detecting invalid block) is > because the block file doesn't exist due to bad disk in this case. > The key issue we found here is, after DN detects an invalid block for the > above reason, it doesn't report the invalid block back to NN, thus NN doesn't > know that the block is corrupted, and keeps sending the data transfer request > to the same DN to be decommissioned, again and again. This caused an infinite > loop, so the decommission process hangs. > Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5928) show namespace and namenode ID on NN dfshealth page
[ https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177607#comment-14177607 ] Siqi Li commented on HDFS-5928: --- I have added the check for both namespace and namenodeID > show namespace and namenode ID on NN dfshealth page > --- > > Key: HDFS-5928 > URL: https://issues.apache.org/jira/browse/HDFS-5928 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, > HDFS-5928.v4.patch, HDFS-5928.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-7235: Attachment: HDFS-7235.003.patch > Can not decommission DN which has invalid block due to bad disk > --- > > Key: HDFS-7235 > URL: https://issues.apache.org/jira/browse/HDFS-7235 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, > HDFS-7235.003.patch > > > When to decommission a DN, the process hangs. > What happens is, when NN chooses a replica as a source to replicate data on > the to-be-decommissioned DN to other DNs, it favors choosing this DN > to-be-decommissioned as the source of transfer (see BlockManager.java). > However, because of the bad disk, the DN would detect the source block to be > transfered as invalidBlock with the following logic in FsDatasetImpl.java: > {code} > /** Does the block exist and have the given state? */ > private boolean isValid(final ExtendedBlock b, final ReplicaState state) { > final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), > b.getLocalBlock()); > return replicaInfo != null > && replicaInfo.getState() == state > && replicaInfo.getBlockFile().exists(); > } > {code} > The reason that this method returns false (detecting invalid block) is > because the block file doesn't exist due to bad disk in this case. > The key issue we found here is, after DN detects an invalid block for the > above reason, it doesn't report the invalid block back to NN, thus NN doesn't > know that the block is corrupted, and keeps sending the data transfer request > to the same DN to be decommissioned, again and again. This caused an infinite > loop, so the decommission process hangs. > Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-5928) show namespace and namenode ID on NN dfshealth page
[ https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated HDFS-5928: -- Attachment: HDFS-5928.v4.patch > show namespace and namenode ID on NN dfshealth page > --- > > Key: HDFS-5928 > URL: https://issues.apache.org/jira/browse/HDFS-5928 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, > HDFS-5928.v4.patch, HDFS-5928.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177600#comment-14177600 ] Hadoop QA commented on HDFS-7266: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675856/hdfs-7266.001.patch against trunk revision 8942741. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8458//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8458//console This message is automatically generated. > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.0 >Reporter: Gopal V >Assignee: Andrew Wang >Priority: Minor > Labels: multi-threading > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177582#comment-14177582 ] Hadoop QA commented on HDFS-7221: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675918/HDFS-7221.005.patch against trunk revision 8942741. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8457//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8457//console This message is automatically generated. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177575#comment-14177575 ] Charles Lamb commented on HDFS-7221: I verified that the three test failures are unrelated. TestDNFencing (with and without replication) are known failures right now. TestDecommission passes on my local machine with the patch applied. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7215) Add JvmPauseMonitor to NFS gateway
[ https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177541#comment-14177541 ] Brandon Li commented on HDFS-7215: -- Thanks, Colin. I've filed HADOOP-11214 to track the effort of adding web UI and other metric information. Depends on how much we want to expose to web UI, HADOOP-11214 might become an umbrella JIRA. We will see. > Add JvmPauseMonitor to NFS gateway > -- > > Key: HDFS-7215 > URL: https://issues.apache.org/jira/browse/HDFS-7215 > Project: Hadoop HDFS > Issue Type: Improvement > Components: nfs >Affects Versions: 2.2.0 >Reporter: Brandon Li >Assignee: Brandon Li >Priority: Minor > Attachments: HDFS-7215.001.patch > > > Like NN/DN, a GC log would help debug issues in NFS gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7270) Implementing congestion control in writing pipeline
[ https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177543#comment-14177543 ] Haohui Mai commented on HDFS-7270: -- The point of this jira is to make the pipeline more stable and reduce unnecessary aborts / recovery. An alternative approach is to implement admission control -- HDFS-7265 proposes to introduce a throttler to limit the amount of the data that is written into HDFS. Deriving the right configuration for the throttler to balance between the stability and throughput of the pipeline, however, is difficult in practice. The loads of the clusters varies from time to time, and the DNs can go ups and downs which can make the configuration suboptimal thus defeat its purpose. > Implementing congestion control in writing pipeline > --- > > Key: HDFS-7270 > URL: https://issues.apache.org/jira/browse/HDFS-7270 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haohui Mai >Assignee: Haohui Mai > > When a client writes to HDFS faster than the disk bandwidth of the DNs, it > saturates the disk bandwidth and put the DNs unresponsive. The client only > backs off by aborting / recovering the pipeline, which leads to failed writes > and unnecessary pipeline recovery. > This jira proposes to add explicit congestion control mechanisms in the > writing pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7270) Implementing congestion control in writing pipeline
Haohui Mai created HDFS-7270: Summary: Implementing congestion control in writing pipeline Key: HDFS-7270 URL: https://issues.apache.org/jira/browse/HDFS-7270 Project: Hadoop HDFS Issue Type: Bug Reporter: Haohui Mai Assignee: Haohui Mai When a client writes to HDFS faster than the disk bandwidth of the DNs, it saturates the disk bandwidth and put the DNs unresponsive. The client only backs off by aborting / recovering the pipeline, which leads to failed writes and unnecessary pipeline recovery. This jira proposes to add explicit congestion control mechanisms in the writing pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7269) NN and DN don't check whether corrupted blocks reported by clients are actually corrupted
Ming Ma created HDFS-7269: - Summary: NN and DN don't check whether corrupted blocks reported by clients are actually corrupted Key: HDFS-7269 URL: https://issues.apache.org/jira/browse/HDFS-7269 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma We had a case where the client machine had memory issue and thus failed the checksum validation of a given block for all its replicas. So the client ended up informing NN about the corrupted blocks for all DNs via reportBadBlocks. However, the block isn't corrupted on any of the DNs. You can still use DFSClient to read the block. But in order to get rid of NN's warning message for corrupt block, we either do a NN fail over, or repair the file via a) copy the file somewhere, b) remove the file, c) copy the file back. It will be useful if NN and DN can validate client's report. In fact, there is a comment in NamenodeRpcServer about this. {noformat} /** * The client has detected an error on the specified located blocks * and is reporting them to the server. For now, the namenode will * mark the block as corrupt. In the future we might * check the blocks are actually corrupt. */ {noformat} To allow system to recover from invalid client report quickly, we can support automatic recovery or manual admins command. 1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN will notify the validate result via IBR and new ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK. 2. Some new admins command to move corrupted blocks out of BM's CorruptReplicasMap and UnderReplicatedBlocks. Appreciate any input. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7154) Fix returning value of starting reconfiguration task
[ https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177528#comment-14177528 ] Colin Patrick McCabe commented on HDFS-7154: +1. I am going to re-run Jenkins to get something which looks a little nicer. > Fix returning value of starting reconfiguration task > > > Key: HDFS-7154 > URL: https://issues.apache.org/jira/browse/HDFS-7154 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode >Affects Versions: 3.0.0, 2.6.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, > HDFS-7154.001.patch, HDFS-7154.001.patch > > > Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} > (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit > code. It is expected to return 0 to indicate success. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk
[ https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177518#comment-14177518 ] Colin Patrick McCabe commented on HDFS-7235: {code} 1787 ReplicaInfo replicaInfo = null; 1788 synchronized(data) { 1789replicaInfo = (ReplicaInfo) data.getReplica( block.getBlockPoolId(), 1790block.getBlockId()); 1791 } 1792 if (replicaInfo != null 1793 && replicaInfo.getState() == ReplicaState.FINALIZED 1794 && !replicaInfo.getBlockFile().exists()) { {code} You can't release the lock this way. Once you release the lock, replicaInfo could be mutated at any time. So you need to do all the check under the lock. {code} 1795// 1796// Report back to NN bad block caused by non-existent block file. 1797// WATCH-OUT: be sure the conditions checked above matches the following 1798// method in FsDatasetImpl.java: 1799// boolean isValidBlock(ExtendedBlock b) 1800// all other conditions need to be true except that 1801// replicaInfo.getBlockFile().exists() returns false. 1802// {code} I don't think we need the "WATCH-OUT" part. We're not calling {{isValidBlock}}, so why do we care if the check is the same as that check? I generally agree with this approach and I think we can get this in if that's fixed. > Can not decommission DN which has invalid block due to bad disk > --- > > Key: HDFS-7235 > URL: https://issues.apache.org/jira/browse/HDFS-7235 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch > > > When to decommission a DN, the process hangs. > What happens is, when NN chooses a replica as a source to replicate data on > the to-be-decommissioned DN to other DNs, it favors choosing this DN > to-be-decommissioned as the source of transfer (see BlockManager.java). > However, because of the bad disk, the DN would detect the source block to be > transfered as invalidBlock with the following logic in FsDatasetImpl.java: > {code} > /** Does the block exist and have the given state? */ > private boolean isValid(final ExtendedBlock b, final ReplicaState state) { > final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), > b.getLocalBlock()); > return replicaInfo != null > && replicaInfo.getState() == state > && replicaInfo.getBlockFile().exists(); > } > {code} > The reason that this method returns false (detecting invalid block) is > because the block file doesn't exist due to bad disk in this case. > The key issue we found here is, after DN detects an invalid block for the > above reason, it doesn't report the invalid block back to NN, thus NN doesn't > know that the block is corrupted, and keeps sending the data transfer request > to the same DN to be decommissioned, again and again. This caused an infinite > loop, so the decommission process hangs. > Thanks [~qwertymaniac] for reporting the issue and initial analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177513#comment-14177513 ] Hadoop QA commented on HDFS-7221: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675887/HDFS-7221.004.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA org.apache.hadoop.hdfs.TestDecommission org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8454//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8454//console This message is automatically generated. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5928) show namespace and namenode ID on NN dfshealth page
[ https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177508#comment-14177508 ] Haohui Mai commented on HDFS-5928: -- It seems that the page might not look right on a non-HA cluster, thus it requires a check to disable the output for non-HA clusters. > show namespace and namenode ID on NN dfshealth page > --- > > Key: HDFS-5928 > URL: https://issues.apache.org/jira/browse/HDFS-5928 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, > HDFS-5928.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177483#comment-14177483 ] Haohui Mai commented on HDFS-6744: -- I think it might be better to load all the information in the browser, since we have to load all information anyway. We can populate the information to DOM when it is requested -- pagination and sorted can be implemented in the same way. > Improve decommissioning nodes and dead nodes access on the new NN webUI > --- > > Key: HDFS-6744 > URL: https://issues.apache.org/jira/browse/HDFS-6744 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Siqi Li > Attachments: HDFS-6744.v1.patch, deadnodespage.png, > decomnodespage.png, livendoespage.png > > > The new NN webUI lists live node at the top of the page, followed by dead > node and decommissioning node. From admins point of view: > 1. Decommissioning nodes and dead nodes are more interesting. It is better to > move decommissioning nodes to the top of the page, followed by dead nodes and > decommissioning nodes. > 2. To find decommissioning nodes or dead nodes, the whole page that includes > all nodes needs to be loaded. That could take some time for big clusters. > The legacy web UI filters out the type of nodes dynamically. That seems to > work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7257) Add the time of last HA state transition to NN's /jmx page
[ https://issues.apache.org/jira/browse/HDFS-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7257: --- Attachment: HDFS-7257.002.patch The test failures in the jenkins run were unrelated. TestBalancer passes on my local machine with the patch applied. The .002 patch moves the test to a more appropriate Test...java file. > Add the time of last HA state transition to NN's /jmx page > -- > > Key: HDFS-7257 > URL: https://issues.apache.org/jira/browse/HDFS-7257 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7257.001.patch, HDFS-7257.002.patch > > > It would be useful to some monitoring apps to expose the last HA transition > time in the NN's /jmx page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7207) Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7207: --- Description: We should consider adding a C\+\+ interface for libhdfs, libhdfs3, and libwebhdfs. This interface should not impose unreasonable compatibility constraints on the libraries, and should be useful for many C\+\+ projects in order to be useful. We may also want to avoid exceptions because some C\+\+ clients do not use them. (was: There are three major disadvantages of exposing exceptions in the public API: * Exposing exceptions in public APIs forces the downstream users to be compiled with {{-fexceptions}}, which might be infeasible in many use cases. * It forces other bindings to properly handle all C++ exceptions, which might be infeasible especially when the binding is generated by tools like SWIG. * It forces the downstream users to properly handle all C++ exceptions, which can be cumbersome as in certain cases it will lead to undefined behavior (e.g., throwing an exception in a destructor is undefined.) ) Priority: Major (was: Blocker) Issue Type: Improvement (was: Bug) > Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs > --- > > Key: HDFS-7207 > URL: https://issues.apache.org/jira/browse/HDFS-7207 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Colin Patrick McCabe > Attachments: HDFS-7207.001.patch > > > We should consider adding a C\+\+ interface for libhdfs, libhdfs3, and > libwebhdfs. This interface should not impose unreasonable compatibility > constraints on the libraries, and should be useful for many C\+\+ projects in > order to be useful. We may also want to avoid exceptions because some C\+\+ > clients do not use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7207) Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7207: --- Issue Type: Bug (was: Sub-task) Parent: (was: HDFS-6994) > Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs > --- > > Key: HDFS-7207 > URL: https://issues.apache.org/jira/browse/HDFS-7207 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haohui Mai >Assignee: Colin Patrick McCabe >Priority: Blocker > Attachments: HDFS-7207.001.patch > > > There are three major disadvantages of exposing exceptions in the public API: > * Exposing exceptions in public APIs forces the downstream users to be > compiled with {{-fexceptions}}, which might be infeasible in many use cases. > * It forces other bindings to properly handle all C++ exceptions, which might > be infeasible especially when the binding is generated by tools like SWIG. > * It forces the downstream users to properly handle all C++ exceptions, which > can be cumbersome as in certain cases it will lead to undefined behavior > (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177426#comment-14177426 ] Colin Patrick McCabe commented on HDFS-7207: bq. As you mentioned, exposing \[shared_ptr\] might force the users to run tools like valgrind to detect leaks. It is impractical to use valgrind in many real-world use cases – valgrind can easily slows the program down for 20x. See http://groups.csail.mit.edu/commit/papers/2011/bruening-cgo11-drmemory.pdf I believe that using {{shared_ptr}} can reduce the frequency of memory leaks in many scenarios, such as this one. Avoiding memory leaks is one reason to use {{shared_ptr}}, in fact. Please do not forget that the C interface can generate memory leaks as well. bq. Though I prefer to having a native C\+\+ interface, for the first cut I think it is fine to implement it using the C interface and to declare the interface as unstable. On the other hand, however, I think we also need to clean up the interface a little bit to make it more usable for C++ users. I agree. Let's move this JIRA out of the HDFS-6994 branch and consider it later. Adding a new API requires a lot of discussion and care, and should be done for all our interface libraries, not just for one. We should focus HDFS-6994 on getting libhdfs3 into a usable state. > libhdfs3 should not expose exceptions in public C++ API > --- > > Key: HDFS-7207 > URL: https://issues.apache.org/jira/browse/HDFS-7207 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Colin Patrick McCabe >Priority: Blocker > Attachments: HDFS-7207.001.patch > > > There are three major disadvantages of exposing exceptions in the public API: > * Exposing exceptions in public APIs forces the downstream users to be > compiled with {{-fexceptions}}, which might be infeasible in many use cases. > * It forces other bindings to properly handle all C++ exceptions, which might > be infeasible especially when the binding is generated by tools like SWIG. > * It forces the downstream users to properly handle all C++ exceptions, which > can be cumbersome as in certain cases it will lead to undefined behavior > (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7207) Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7207: --- Summary: Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs (was: libhdfs3 should not expose exceptions in public C++ API) > Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs > --- > > Key: HDFS-7207 > URL: https://issues.apache.org/jira/browse/HDFS-7207 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Colin Patrick McCabe >Priority: Blocker > Attachments: HDFS-7207.001.patch > > > There are three major disadvantages of exposing exceptions in the public API: > * Exposing exceptions in public APIs forces the downstream users to be > compiled with {{-fexceptions}}, which might be infeasible in many use cases. > * It forces other bindings to properly handle all C++ exceptions, which might > be infeasible especially when the binding is generated by tools like SWIG. > * It forces the downstream users to properly handle all C++ exceptions, which > can be cumbersome as in certain cases it will lead to undefined behavior > (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177427#comment-14177427 ] Yongjun Zhang commented on HDFS-7221: - Thanks Charles and Ming, the latest patch LGTM too. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7227) Fix findbugs warning about NP_DEREFERENCE_OF_READLINE_VALUE in SpanReceiverHost
[ https://issues.apache.org/jira/browse/HDFS-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177417#comment-14177417 ] Colin Patrick McCabe commented on HDFS-7227: bq. Tsuyoshi wrote: Hi Colin Patrick McCabe, Java coding style says that we should avoid emitting braces: Right. That's why I commented that "I thought there was some text in there about short "if" statements being OK to do on one line, but I don't see it in the guide." bq. stack wrote: Patch LGTM +1. Can I get another +1 on this? Since we're being pedantic :) It's clear that the findbugs warning in AbstractDelegationTokenSecretManager is not related, since this patch doesn't change that. > Fix findbugs warning about NP_DEREFERENCE_OF_READLINE_VALUE in > SpanReceiverHost > --- > > Key: HDFS-7227 > URL: https://issues.apache.org/jira/browse/HDFS-7227 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-7227.001.patch, HDFS-7227.002.patch > > > Fix findbugs warning about NP_DEREFERENCE_OF_READLINE_VALUE in > SpanReceiverHost -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7184) Allow data migration tool to run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177416#comment-14177416 ] Tsz Wo Nicholas Sze commented on HDFS-7184: --- Hi Benoy, let's also merge this to 2.6 where the mover script is firstly introduced? > Allow data migration tool to run as a daemon > > > Key: HDFS-7184 > URL: https://issues.apache.org/jira/browse/HDFS-7184 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover, scripts >Affects Versions: 3.0.0 >Reporter: Benoy Antony >Assignee: Benoy Antony >Priority: Minor > Fix For: 3.0.0 > > Attachments: HDFS-7184.patch, HDFS-7184.patch > > > Just like balancer, it is sometimes required to run data migration tool in a > daemon mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7215) Add JvmPauseMonitor to NFS gateway
[ https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177413#comment-14177413 ] Colin Patrick McCabe commented on HDFS-7215: Looks good to me. Are you going to add a way to retrieve the JvmMetrics from the NFS gateway web UI, like {{DataNodeMetrics#getJvmMetrics}}? We could also file a follow-on JIRA to do that if that's more convenient. > Add JvmPauseMonitor to NFS gateway > -- > > Key: HDFS-7215 > URL: https://issues.apache.org/jira/browse/HDFS-7215 > Project: Hadoop HDFS > Issue Type: Improvement > Components: nfs >Affects Versions: 2.2.0 >Reporter: Brandon Li >Assignee: Brandon Li >Priority: Minor > Attachments: HDFS-7215.001.patch > > > Like NN/DN, a GC log would help debug issues in NFS gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7165) Separate block metrics for files with replication count 1
[ https://issues.apache.org/jira/browse/HDFS-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177400#comment-14177400 ] Andrew Wang commented on HDFS-7165: --- Almost there, thanks for revving Zhe. * In ClientProtocol#getStats, it mentions "total used space of the block pool", and I see that being set in HeartbeatManager, but AFAICT it's dropped in the PB layer on the server side. If it's not being used, let's remove it. If not, it's a compat issue to insert something at an already-being-used index of the stats array. * TestMissingBlocksAlert still has a whitespace-only change. Line 79-80 were deleted. * TestUnderReplicatedBlockQueues, the extend: {code} public class TestUnderReplicatedBlockQueues extends Assert { {code} We should not "extends Assert" in test cases. Instead, let's add static imports on the various Asserts being used. Let's undo the assertInLevel changes too, using {{fail}} as it was before was good. > Separate block metrics for files with replication count 1 > - > > Key: HDFS-7165 > URL: https://issues.apache.org/jira/browse/HDFS-7165 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Andrew Wang >Assignee: Zhe Zhang > Attachments: HDFS-7165-20141003-v1.patch, > HDFS-7165-20141009-v1.patch, HDFS-7165-20141010-v1.patch, > HDFS-7165-20141015-v1.patch > > > We see a lot of escalations because someone has written teragen output with a > replication factor of 1, a DN goes down, and a bunch of missing blocks show > up. These are normally false positives, since teragen output is disposable, > and generally speaking, users should understand this is true for all repl=1 > files. > It'd be nice to be able to separate out these repl=1 missing blocks from > missing blocks with higher replication factors.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API
[ https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177388#comment-14177388 ] Haohui Mai commented on HDFS-7207: -- Though I prefer to having a native C\+\+ interface, for the first cut I think it is fine to implement it using the C interface and to declare the interface as unstable. On the other hand, however, I think we also need to clean up the interface a little bit to make it more usable for C\+\+ users. > libhdfs3 should not expose exceptions in public C++ API > --- > > Key: HDFS-7207 > URL: https://issues.apache.org/jira/browse/HDFS-7207 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Haohui Mai >Assignee: Colin Patrick McCabe >Priority: Blocker > Attachments: HDFS-7207.001.patch > > > There are three major disadvantages of exposing exceptions in the public API: > * Exposing exceptions in public APIs forces the downstream users to be > compiled with {{-fexceptions}}, which might be infeasible in many use cases. > * It forces other bindings to properly handle all C++ exceptions, which might > be infeasible especially when the binding is generated by tools like SWIG. > * It forces the downstream users to properly handle all C++ exceptions, which > can be cumbersome as in certain cases it will lead to undefined behavior > (e.g., throwing an exception in a destructor is undefined.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7264) Tha last datanode in a pipeline should send a heartbeat when there is no traffic
[ https://issues.apache.org/jira/browse/HDFS-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177386#comment-14177386 ] Hadoop QA commented on HDFS-7264: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675891/h7264_20141020.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8455//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8455//console This message is automatically generated. > Tha last datanode in a pipeline should send a heartbeat when there is no > traffic > > > Key: HDFS-7264 > URL: https://issues.apache.org/jira/browse/HDFS-7264 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Attachments: h7264_20141017.patch, h7264_20141020.patch > > > When the client is writing slowly, the client will send a heartbeat to signal > that the connection is still alive. This case works fine. > However, when a client is writing fast but some of the datanodes in the > pipeline are busy, a PacketResponder may get a timeout since no ack is sent > from the upstream datanode. We suggest that the last datanode in a pipeline > should send a heartbeat when there is no traffic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7266: --- Status: Patch Available (was: Open) > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Andrew Wang > Labels: multi-threading > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-7266: --- Priority: Minor (was: Major) Affects Version/s: (was: 2.6.0) 2.7.0 Issue Type: Improvement (was: Bug) > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.0 >Reporter: Gopal V >Assignee: Andrew Wang >Priority: Minor > Labels: multi-threading > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177385#comment-14177385 ] Colin Patrick McCabe commented on HDFS-7266: Pending jenkins, of course > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.7.0 >Reporter: Gopal V >Assignee: Andrew Wang > Labels: multi-threading > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object
[ https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177384#comment-14177384 ] Colin Patrick McCabe commented on HDFS-7266: +1. Thanks, Andrew and Gopal. > HDFS Peercache enabled check should not lock on object > -- > > Key: HDFS-7266 > URL: https://issues.apache.org/jira/browse/HDFS-7266 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Andrew Wang > Labels: multi-threading > Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch > > > HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled. > {code} > public synchronized Peer get(DatanodeID dnId, boolean isDomain) { > if (capacity <= 0) { // disabled > return null; > } > {code} > since capacity is a final, this could be moved outside the lock. > !dfs-open-10-threads.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177375#comment-14177375 ] Ming Ma commented on HDFS-7221: --- Thanks, Charles. The latest patch LGTM. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7221: --- Attachment: HDFS-7221.005.patch [~mingma], Yes, aesthetically that is better. I've changed that in the .005 version. Thanks for the review. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177354#comment-14177354 ] Ming Ma commented on HDFS-7221: --- Thanks, Charles. It shouldn't change the test result either way, but it is better if dfs.namenode.replication.max-streams is set to 16 as well. Otherwise, others might wonder dfs.namenode.replication.max-streams is set to much larger value. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7184) Allow data migration tool to run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177347#comment-14177347 ] Hudson commented on HDFS-7184: -- FAILURE: Integrated in Hadoop-trunk-Commit #6292 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6292/]) HDFS-7184. Allow data migration tool to run as a daemon. (Benoy Antony) (benoy: rev e4d6a878541cc07fada2bd07dedc4740570a472e) * hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs > Allow data migration tool to run as a daemon > > > Key: HDFS-7184 > URL: https://issues.apache.org/jira/browse/HDFS-7184 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover, scripts >Affects Versions: 3.0.0 >Reporter: Benoy Antony >Assignee: Benoy Antony >Priority: Minor > Fix For: 3.0.0 > > Attachments: HDFS-7184.patch, HDFS-7184.patch > > > Just like balancer, it is sometimes required to run data migration tool in a > daemon mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177344#comment-14177344 ] Benoy Antony commented on HDFS-7204: +1 > balancer doesn't run as a daemon > > > Key: HDFS-7204 > URL: https://issues.apache.org/jira/browse/HDFS-7204 > Project: Hadoop HDFS > Issue Type: Bug > Components: scripts >Affects Versions: 3.0.0 >Reporter: Allen Wittenauer >Assignee: Allen Wittenauer >Priority: Blocker > Labels: newbie > Attachments: HDFS-7204-01.patch, HDFS-7204.patch > > > From HDFS-7184, minor issues with balancer: > * daemon isn't set to true in hdfs to enable daemonization > * start-balancer script has usage instead of hadoop_usage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7184) Allow data migration tool to run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated HDFS-7184: --- Target Version/s: 3.0.0 Affects Version/s: 3.0.0 > Allow data migration tool to run as a daemon > > > Key: HDFS-7184 > URL: https://issues.apache.org/jira/browse/HDFS-7184 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover, scripts >Affects Versions: 3.0.0 >Reporter: Benoy Antony >Assignee: Benoy Antony >Priority: Minor > Fix For: 3.0.0 > > Attachments: HDFS-7184.patch, HDFS-7184.patch > > > Just like balancer, it is sometimes required to run data migration tool in a > daemon mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7184) Allow data migration tool to run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated HDFS-7184: --- Resolution: Fixed Fix Version/s: 3.0.0 Target Version/s: (was: 2.6.0) Status: Resolved (was: Patch Available) committed to trunk. > Allow data migration tool to run as a daemon > > > Key: HDFS-7184 > URL: https://issues.apache.org/jira/browse/HDFS-7184 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover, scripts >Reporter: Benoy Antony >Assignee: Benoy Antony >Priority: Minor > Fix For: 3.0.0 > > Attachments: HDFS-7184.patch, HDFS-7184.patch > > > Just like balancer, it is sometimes required to run data migration tool in a > daemon mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7218) FSNamesystem ACL operations should write to audit log on failure
[ https://issues.apache.org/jira/browse/HDFS-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177335#comment-14177335 ] Charles Lamb commented on HDFS-7218: The two test failures are unrelated. > FSNamesystem ACL operations should write to audit log on failure > > > Key: HDFS-7218 > URL: https://issues.apache.org/jira/browse/HDFS-7218 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7218.001.patch, HDFS-7218.002.patch, > HDFS-7218.003.patch > > > Various Acl methods in FSNamesystem do not write to the audit log when the > operation is not successful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177332#comment-14177332 ] Charles Lamb commented on HDFS-7221: TestDNFencing is known to fail lately. TestInterDatanodeProtocol runs ok on my local machine with the patch applied. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177319#comment-14177319 ] Hadoop QA commented on HDFS-7221: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675869/HDFS-7221.003.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestInterDatanodeProtocol {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8453//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8453//console This message is automatically generated. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7211) Block invalidation work should be ordered
[ https://issues.apache.org/jira/browse/HDFS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177311#comment-14177311 ] Andrew Wang commented on HDFS-7211: --- Maybe LightWeightLinkedSet? > Block invalidation work should be ordered > - > > Key: HDFS-7211 > URL: https://issues.apache.org/jira/browse/HDFS-7211 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > > {{InvalidateBlocks#node2blocks}} uses an unordered {{LightWeightHashSet}} to > store blocks (to be invalidated) on the same DN. This causes poor ordering > when a single DN has a large number of blocks to invalidate. Blocks should be > invalidated following the order of invalidation commands -- at least roughly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7225) Failed DataNode lookup can crash NameNode with NullPointerException
[ https://issues.apache.org/jira/browse/HDFS-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177307#comment-14177307 ] Andrew Wang commented on HDFS-7225: --- Nice examination here Zhe. One high-level question though, could we simplify the above by cleaning InvalidateBlocks immediately upon seeing the new datanodeUuid? If the old volume is brought back, the old blocks will be in the block report and the NN will re-populate InvalidateBlocks as needed when it processes the report. > Failed DataNode lookup can crash NameNode with NullPointerException > --- > > Key: HDFS-7225 > URL: https://issues.apache.org/jira/browse/HDFS-7225 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7225-v1.patch > > > {{BlockManager#invalidateWorkForOneNode}} looks up a DataNode by the > {{datanodeUuid}} and passes the resultant {{DatanodeDescriptor}} to > {{InvalidateBlocks#invalidateWork}}. However, if a wrong or outdated > {{datanodeUuid}} is used, a null pointer will be passed to {{invalidateWork}} > which will use it to lookup in a {{TreeMap}}. Since the key type is > {{DatanodeDescriptor}}, key comparison is based on the IP address. A null key > will crash the NameNode with an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7257) Add the time of last HA state transition to NN's /jmx page
[ https://issues.apache.org/jira/browse/HDFS-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177289#comment-14177289 ] Hadoop QA commented on HDFS-7257: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675855/HDFS-7257.001.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1265 javac compiler warnings (more than the trunk's current 1264 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing org.apache.hadoop.hdfs.server.balancer.TestBalancer The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHdfsAdmin {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8451//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8451//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8451//console This message is automatically generated. > Add the time of last HA state transition to NN's /jmx page > -- > > Key: HDFS-7257 > URL: https://issues.apache.org/jira/browse/HDFS-7257 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7257.001.patch > > > It would be useful to some monitoring apps to expose the last HA transition > time in the NN's /jmx page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7218) FSNamesystem ACL operations should write to audit log on failure
[ https://issues.apache.org/jira/browse/HDFS-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177290#comment-14177290 ] Hadoop QA commented on HDFS-7218: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675861/HDFS-7218.003.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestHdfsAdmin {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8452//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8452//console This message is automatically generated. > FSNamesystem ACL operations should write to audit log on failure > > > Key: HDFS-7218 > URL: https://issues.apache.org/jira/browse/HDFS-7218 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7218.001.patch, HDFS-7218.002.patch, > HDFS-7218.003.patch > > > Various Acl methods in FSNamesystem do not write to the audit log when the > operation is not successful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7244) Reduce Namenode memory using Flyweight pattern
[ https://issues.apache.org/jira/browse/HDFS-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177280#comment-14177280 ] Colin Patrick McCabe commented on HDFS-7244: It's exciting to see progress on this, [~langera]! There are a few questions we need to figure out here. One is fallback... if {{ByteBuffer#allocDirect}} is not available on the JVM, what do we do? In my earlier patch, I simply used {{ByteBuffer#alloc}}. I still like this approach, but it does mean we can't chase raw pointers when implementing off-heap data structures. I was trying to address this by using \{ 32-bit slab ID, 32-bit slab offset \} tuples instead. This does require that we do a lookup in a {{map}} whenever we chase a "pointer", though. Another approach to fallback is to use raw pointers if they're available, and \{ slabID, offset\} tuples if they're not. This is faster for the common case of true off-heaping. The complication here is that theoretically one {{allocDirect}} calls could fail while another succeeds. If we did this, we'd probably want to create a configuration key like {{hadoop.use.off.heap}}, and throw a hard failure whenever this was {{true}} but {{allocDirect}} failed. What data structures are you planning on using to look up block data in the NN? I was considering an off-heap hash map implementation. If you look at the requirements for our BlocksMap, we need: * fast lookup of \{ 64-bit blockID, string bpId \} to yield all DNs where this block is replicated * ability to iterate over all blocks which a DN holds #1 is not too difficult, but #2 could be tricky. The obvious solution is just to have a hash map from \{ blockID, bpID \} to a node structure which is a member of a few implicit linked lists. This does mean the node structure has variable size, which could be challenging to implement (It's basically the {{malloc}} problem). There isn't any upper limit on the number of DNs a block can be on. A better way might be to have a hash map from \{ blockID, bpID, replicaIndex \} so that we avoid implicit linked lists. So to find the first replica for BlockID 123 in bpID "foo", you look up (123, foo, 0)... the second, (123, foo, 1), and so forth. This also raises a few questions. * should we create a lookup table for bpids? We clearly don't want to store the string everywhere, and we can't use Java string interning when doing off-heap. A 16-bit or 32-bit lookup table from string bpid -> bpid index would certainly slim this down. * similar for DNs... how do we identify them? The storage ID is too long to be practical. The simplest way would be a 64-bit ID where we didn't reuse any indices. If we have 32-bit or less DN IDs we'll have to figure out some garbage collection strategy, which could be tricky. Do you think we'll need a branch for this? I don't have a feeling yet for how incremental it is. Clearly adding the Slab code can be done in trunk without destabilizing anything else. I'm not as clear on how difficult the other subtasks are going to be to do in an "incremental" way. Do you have some code using the Slab code yet? It might be hard to know exactly what API we want for Slab until we see how it works in action. Of course we can always modify it later, but posting a combined patch would give me a better feel for it. > Reduce Namenode memory using Flyweight pattern > -- > > Key: HDFS-7244 > URL: https://issues.apache.org/jira/browse/HDFS-7244 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Amir Langer > > Using the flyweight pattern can dramatically reduce memory usage in the > Namenode. The pattern also abstracts the actual storage type and allows the > decision of whether it is off-heap or not and what is the serialisation > mechanism to be configured per deployment. > The idea is to move all BlockInfo data (as a first step) to this storage > using the Flyweight pattern. The cost to doing it will be in higher latency > when accessing/modifying a block. The idea is that this will be offset with a > reduction in memory and in the case of off-heap, a dramatic reduction in > memory (effectively, memory used for BlockInfo would reduce to a very small > constant value). > This reduction will also have an huge impact on the latency as GC pauses will > be reduced considerably and may even end up with better latency results than > the original code. > I wrote a stand-alone project as a proof of concept, to show the pattern, the > data structure we can use and what will be the performance costs of this > approach. > see [Slab|https://github.com/langera/slab] > and [Slab performance > results|https://github.com/langera/slab/wiki/Performance-Results]. > Slab abstracts the storage, gives several s
[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177262#comment-14177262 ] Siqi Li commented on HDFS-6744: --- I have attached 3 screenshots of each page(livenodes, deadnodes, decomnodes) > Improve decommissioning nodes and dead nodes access on the new NN webUI > --- > > Key: HDFS-6744 > URL: https://issues.apache.org/jira/browse/HDFS-6744 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Siqi Li > Attachments: HDFS-6744.v1.patch, deadnodespage.png, > decomnodespage.png, livendoespage.png > > > The new NN webUI lists live node at the top of the page, followed by dead > node and decommissioning node. From admins point of view: > 1. Decommissioning nodes and dead nodes are more interesting. It is better to > move decommissioning nodes to the top of the page, followed by dead nodes and > decommissioning nodes. > 2. To find decommissioning nodes or dead nodes, the whole page that includes > all nodes needs to be loaded. That could take some time for big clusters. > The legacy web UI filters out the type of nodes dynamically. That seems to > work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177261#comment-14177261 ] Hadoop QA commented on HDFS-6744: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675894/decomnodespage.png against trunk revision d5084b9. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8456//console This message is automatically generated. > Improve decommissioning nodes and dead nodes access on the new NN webUI > --- > > Key: HDFS-6744 > URL: https://issues.apache.org/jira/browse/HDFS-6744 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Siqi Li > Attachments: HDFS-6744.v1.patch, deadnodespage.png, > decomnodespage.png, livendoespage.png > > > The new NN webUI lists live node at the top of the page, followed by dead > node and decommissioning node. From admins point of view: > 1. Decommissioning nodes and dead nodes are more interesting. It is better to > move decommissioning nodes to the top of the page, followed by dead nodes and > decommissioning nodes. > 2. To find decommissioning nodes or dead nodes, the whole page that includes > all nodes needs to be loaded. That could take some time for big clusters. > The legacy web UI filters out the type of nodes dynamically. That seems to > work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated HDFS-6744: -- Attachment: decomnodespage.png deadnodespage.png livendoespage.png > Improve decommissioning nodes and dead nodes access on the new NN webUI > --- > > Key: HDFS-6744 > URL: https://issues.apache.org/jira/browse/HDFS-6744 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Siqi Li > Attachments: HDFS-6744.v1.patch, deadnodespage.png, > decomnodespage.png, livendoespage.png > > > The new NN webUI lists live node at the top of the page, followed by dead > node and decommissioning node. From admins point of view: > 1. Decommissioning nodes and dead nodes are more interesting. It is better to > move decommissioning nodes to the top of the page, followed by dead nodes and > decommissioning nodes. > 2. To find decommissioning nodes or dead nodes, the whole page that includes > all nodes needs to be loaded. That could take some time for big clusters. > The legacy web UI filters out the type of nodes dynamically. That seems to > work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7264) Tha last datanode in a pipeline should send a heartbeat when there is no traffic
[ https://issues.apache.org/jira/browse/HDFS-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-7264: -- Attachment: h7264_20141020.patch h7264_20141020.patch: fixes the typos. > Tha last datanode in a pipeline should send a heartbeat when there is no > traffic > > > Key: HDFS-7264 > URL: https://issues.apache.org/jira/browse/HDFS-7264 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Attachments: h7264_20141017.patch, h7264_20141020.patch > > > When the client is writing slowly, the client will send a heartbeat to signal > that the connection is still alive. This case works fine. > However, when a client is writing fast but some of the datanodes in the > pipeline are busy, a PacketResponder may get a timeout since no ack is sent > from the upstream datanode. We suggest that the last datanode in a pipeline > should send a heartbeat when there is no traffic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7211) Block invalidation work should be ordered
[ https://issues.apache.org/jira/browse/HDFS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-7211: -- Component/s: namenode Target Version/s: 2.7.0 Affects Version/s: 2.5.1 > Block invalidation work should be ordered > - > > Key: HDFS-7211 > URL: https://issues.apache.org/jira/browse/HDFS-7211 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.5.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > > {{InvalidateBlocks#node2blocks}} uses an unordered {{LightWeightHashSet}} to > store blocks (to be invalidated) on the same DN. This causes poor ordering > when a single DN has a large number of blocks to invalidate. Blocks should be > invalidated following the order of invalidation commands -- at least roughly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7264) Tha last datanode in a pipeline should send a heartbeat when there is no traffic
[ https://issues.apache.org/jira/browse/HDFS-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177248#comment-14177248 ] Tsz Wo Nicholas Sze commented on HDFS-7264: --- Hi Vinay, thanks for reviewing the patch. > Why can't heartbeat be enabled always.. without configuration flag, which is > disabled by default. ? It is for rolling upgrade. We have to disable the feature first, upgrade, and then enable the feature. Otherwise, the old software cannot handle the new heartbeat. > Tha last datanode in a pipeline should send a heartbeat when there is no > traffic > > > Key: HDFS-7264 > URL: https://issues.apache.org/jira/browse/HDFS-7264 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Attachments: h7264_20141017.patch > > > When the client is writing slowly, the client will send a heartbeat to signal > that the connection is still alive. This case works fine. > However, when a client is writing fast but some of the datanodes in the > pipeline are busy, a PacketResponder may get a timeout since no ack is sent > from the upstream datanode. We suggest that the last datanode in a pipeline > should send a heartbeat when there is no traffic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177240#comment-14177240 ] Haohui Mai commented on HDFS-6744: -- [~l201514], can you please post a screenshot? Thanks. > Improve decommissioning nodes and dead nodes access on the new NN webUI > --- > > Key: HDFS-6744 > URL: https://issues.apache.org/jira/browse/HDFS-6744 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Siqi Li > Attachments: HDFS-6744.v1.patch > > > The new NN webUI lists live node at the top of the page, followed by dead > node and decommissioning node. From admins point of view: > 1. Decommissioning nodes and dead nodes are more interesting. It is better to > move decommissioning nodes to the top of the page, followed by dead nodes and > decommissioning nodes. > 2. To find decommissioning nodes or dead nodes, the whole page that includes > all nodes needs to be loaded. That could take some time for big clusters. > The legacy web UI filters out the type of nodes dynamically. That seems to > work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7221: --- Attachment: HDFS-7221.004.patch [~mingma], Thanks for the review. That seems like a good idea. The .004 patch moves the setting to HAStressTestHarness. We can see if the jenkins run blows anything up. > TestDNFencingWithReplication fails consistently > --- > > Key: HDFS-7221 > URL: https://issues.apache.org/jira/browse/HDFS-7221 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.6.0 >Reporter: Charles Lamb >Assignee: Charles Lamb >Priority: Minor > Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, > HDFS-7221.003.patch, HDFS-7221.004.patch > > > TestDNFencingWithReplication consistently fails with a timeout, both in > jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message
[ https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177215#comment-14177215 ] Andrew Wang commented on HDFS-3342: --- Hi Yongjun, thanks for working on this, Looking at the new output you posted, it looks like it quashes the ERROR log, but we still end up with 3 log prints for the same issue, and one is still at WARN. Wouldn't an ideal solution print just a single log message at INFO? Also note that if someone has the log level set to WARN (happens in production deployments), they'll see the scary stack trace but not the new log print you added. It'd also be nice to not have stack trace spam in this situation, since it's somewhat expected. LMK what you think, thanks again. > SocketTimeoutException in BlockSender.sendChunks could have a better error > message > -- > > Key: HDFS-3342 > URL: https://issues.apache.org/jira/browse/HDFS-3342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.0.0-alpha >Reporter: Todd Lipcon >Assignee: Yongjun Zhang >Priority: Minor > Labels: supportability > Attachments: HDFS-3342.001.patch > > > Currently, if a client connects to a DN and begins to read a block, but then > stops calling read() for a long period of time, the DN will log a > SocketTimeoutException "48 millis timeout while waiting for channel to be > ready for write." This is because there is no "keepalive" functionality of > any kind. At a minimum, we should improve this error message to be an INFO > level log which just says that the client likely stopped reading, so > disconnecting it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7265) Use a throttler for replica write in datanode
[ https://issues.apache.org/jira/browse/HDFS-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177164#comment-14177164 ] Haohui Mai commented on HDFS-7265: -- I found that it is better to throttle dynamically instead of throttling on a pre-defined bandwidth. Other workloads in the clusters can dramatically impact the disk utilization, thus it is quite difficult to come up with a configuration that can protect the DNs from being overloaded but also saturating the peak throughput. > Use a throttler for replica write in datanode > - > > Key: HDFS-7265 > URL: https://issues.apache.org/jira/browse/HDFS-7265 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Attachments: h7265_20141018.patch > > > BlockReceiver process packets in BlockReceiver.receivePacket() as follows > # read from socket > # enqueue the ack > # write to downstream > # write to disk > The above steps is repeated for each packet in a single thread. When there > are a lot of concurrent writes in a datanode, the write time in #4 becomes > very long. As a result, it leads to SocketTimeoutException since it cannot > read from the socket for a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)