[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278619#comment-13278619 ] Colin Patrick McCabe commented on HDFS-2982: bq. The javadoc for JournalSet#selectInputStreams is a little over-simplified =) - how about describing the algorithm (get the streams starting with fromTxid from all managers, return a list sorted by the starting txid etc) Ok, will add. bq. In EditLogFileInputStream#init why only close the stream that threw? Yeah, I guess closing an already closed stream should be idempotent, at least if they're correctly implementing the Closable interface. bq. In TestEditLog readAllEdits is dead code ok bq. How about describing the high-level approach in the patch? >From the high level, this patch is about getting rid of two APIs in >JournalManager-- getNumberOfTransactions and getInputStream, and adding one >API to JournalManager-- selectInputStreams. The new API simply gathers up all >the available streams in one go and puts them into a Collection. This is more >efficient, and also better for some of the changes we'd like to make in the >future, like supporting overlapping edit log streams. Edit log validation is the process of finding out how far in-progress edit logs go. We do it during edit log finalization so that we can find out what to rename the in-progress edit log file to. ("validation" might not be a great name for this process, but it's probably too late to change it now.) We don't validate finalized logs. There are some minor changes to validation here, and a major change. First, the minor changes. One change is to have the validation class contain only the end txid, rather than the start txid, number of txids, and end txid. The start txid is already known, and the number of txids does not represent what you might think, but merely end - start + 1. So it's good to get rid of that cruft. Another minor change is that EditLogValidation#corruptionDetected was renamed to EditLogValidation#hasCorruptHeader. That is the concept it always represented-- it never referred to anything other than header corruption, and the rest of the code even uses the terminology hasCorruptHeader to represent this info (see EditLogFile#hasCorruptHeader). So I'm just trying to be consistent. The major change is that we now read to the end of a corrupt file in validation, finding the true end transaction rather than merely the first unreadable txid. This is needed for recovery to work properly on these files. It's possible that this change could be dropped from this patch. Conceptually, it's more related to HDFS-3049. > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Colin Patrick McCabe >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby
[ https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278617#comment-13278617 ] Rakesh R commented on HDFS-3441: Yeah, I have seen a race condition between the purgeLogsOlderThan() by Standby and the finalizeLogSegment() by Active. Cause: Following are the sequence of operations happening: 1) When standby comes to purge, it is reading all the list of ledger logs including the inprogress_72 files. 2) Meantime Active NN is finalizing the logSegment inprogress_72 and creating new inprogress_74. 3) Now the Standby is reading the data of inprogress_72 to decide whether its inprogress or not and is throwing NoNodeException. I feel, the filtering of inprogress file could be done based on the file name itself, instead of reading the content and filtering based on the data like as follows: BookKeeperJournalManager.java {noformat} List ledgerNames = zkc.getChildren(ledgerPath, false); for (String ledgerName : ledgerNames) { if( !inProgressOk && ledgerName.contains("inprogress") ){ continue; } ledgers.add(EditLogLedgerMetadata.read(zkc, ledgerPath + "/" + ledgerName)); } } catch (Exception e) { throw new IOException("Exception reading ledger list from zk", e); } {noformat} > Race condition between rolling logs at active NN and purging at standby > --- > > Key: HDFS-3441 > URL: https://issues.apache.org/jira/browse/HDFS-3441 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: suja s > > Standby NN has got the ledgerlist with list of all files, including the > inprogress file (with say inprogress_val1) > Active NN has done finalization and created new inprogress file. > Standby when proceeds further finds that the inprogress file which it had in > the list is not present and NN gets shutdown > NN Logs > = > 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Image file of size 201 saved in 0 seconds. > 2012-05-17 22:15:03,874 INFO > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll > on remote NameNode /xx.xx.xx.102:8020 > 2012-05-17 22:15:03,923 INFO > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to > retain 2 images with txid >= 111 > 2012-05-17 22:15:03,923 INFO > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old > image > FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109, > cpktTxId=109) > 2012-05-17 22:15:03,961 FATAL > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 > failed for required journal > (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, > stream=null)) > java.io.IOException: Exception reading ledger list from zk > at > org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531) > at > org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011) > at > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$Checkpoint
[jira] [Updated] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby
[ https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] suja s updated HDFS-3441: - Description: Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1) Active NN has done finalization and created new inprogress file. Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown NN Logs = 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds. 2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /xx.xx.xx.102:8020 2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 111 2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109, cpktTxId=109) 2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, stream=null)) java.io.IOException: Exception reading ledger list from zk at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531) at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444) at org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322) at org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011) at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98) at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142) at org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113) at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528) ... 16 more 2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: ZK Data [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74 -40;59;116 cZxid = 0x2be ctime = Thu May 17 22:15:03 IST 2012 mZxid = 0x2be mtime = Thu May 17 22:15:03 IST 2012 pZxid = 0x2be cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 10 numChildren = 0 was: Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1) Active NN has done finalization and created new inprogress file. Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown NN Logs = 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds. 2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.
[jira] [Commented] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.
[ https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278613#comment-13278613 ] Vinay commented on HDFS-3436: - Thanks Nicholas, that works. I will upload a patch for that. > Append to file is failing when one of the datanode where the block present is > down. > --- > > Key: HDFS-3436 > URL: https://issues.apache.org/jira/browse/HDFS-3436 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 2.0.0 >Reporter: Brahma Reddy Battula >Assignee: Vinay > > Scenario: > = > 1. Cluster with 4 DataNodes. > 2. Written file to 3 DNs, DN1->DN2->DN3 > 3. Stopped DN3, > Now Append to file is failing due to addDatanode2ExistingPipeline is failed. > *CLinet Trace* > {noformat} > 2012-04-24 22:06:09,947 INFO hdfs.DFSClient > (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in > createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as ***:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:09,947 WARN hdfs.DFSClient > (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery > for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 > in pipeline *:50010, **:50010, *:50010: bad datanode **:50010 > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient (DFSOutputStream.java:run(549)) > - DataStreamer Exception > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient > (DFSOutputStream.java:hflush(1515)) - Error while syncing > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > {noformat} > *DataNode Trace* > {noformat} > 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - > host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation > src: /127.0.0.1:49811 dest: /127.0.0.1:49744 > java.io.IOException: > BP-2001850558-xx.xx.xx.xx-1337249347060:blk_-8165642083860293107_1002 is > neither a RBW nor a Finalized, r=ReplicaBeingWritten, > blk_-8165642083860293107_1003, RBW > getNumBytes() = 1024 > getBytesOnDisk() = 1024 > getVisibleLength()= 1024 > getVolume() = > E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current > getBlockFile()= > E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-xx.xx.xx.xx-1337249347060\current\rbw\blk_-8165642083860293107 > bytesAcked=1024 > bytesOnDisk=102 > at > org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525) > at > org.apache.hadoop.hdfs.protocol.da
[jira] [Work started] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.
[ https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-3436 started by Vinay. > Append to file is failing when one of the datanode where the block present is > down. > --- > > Key: HDFS-3436 > URL: https://issues.apache.org/jira/browse/HDFS-3436 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 2.0.0 >Reporter: Brahma Reddy Battula >Assignee: Vinay > > Scenario: > = > 1. Cluster with 4 DataNodes. > 2. Written file to 3 DNs, DN1->DN2->DN3 > 3. Stopped DN3, > Now Append to file is failing due to addDatanode2ExistingPipeline is failed. > *CLinet Trace* > {noformat} > 2012-04-24 22:06:09,947 INFO hdfs.DFSClient > (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in > createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as ***:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:09,947 WARN hdfs.DFSClient > (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery > for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 > in pipeline *:50010, **:50010, *:50010: bad datanode **:50010 > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient (DFSOutputStream.java:run(549)) > - DataStreamer Exception > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient > (DFSOutputStream.java:hflush(1515)) - Error while syncing > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > {noformat} > *DataNode Trace* > {noformat} > 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - > host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation > src: /127.0.0.1:49811 dest: /127.0.0.1:49744 > java.io.IOException: > BP-2001850558-xx.xx.xx.xx-1337249347060:blk_-8165642083860293107_1002 is > neither a RBW nor a Finalized, r=ReplicaBeingWritten, > blk_-8165642083860293107_1003, RBW > getNumBytes() = 1024 > getBytesOnDisk() = 1024 > getVisibleLength()= 1024 > getVolume() = > E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current > getBlockFile()= > E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-xx.xx.xx.xx-1337249347060\current\rbw\blk_-8165642083860293107 > bytesAcked=1024 > bytesOnDisk=102 > at > org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:114) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:78
[jira] [Created] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby
suja s created HDFS-3441: Summary: Race condition between rolling logs at active NN and purging at standby Key: HDFS-3441 URL: https://issues.apache.org/jira/browse/HDFS-3441 Project: Hadoop HDFS Issue Type: Sub-task Reporter: suja s Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1) Active NN has done finalization and created new inprogress file. Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown NN Logs = 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds. 2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /10.18.40.102:8020 2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 111 2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109, cpktTxId=109) 2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, stream=null)) java.io.IOException: Exception reading ledger list from zk at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531) at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444) at org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322) at org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011) at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98) at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142) at org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113) at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528) ... 16 more 2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: ZK Data [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74 -40;59;116 cZxid = 0x2be ctime = Thu May 17 22:15:03 IST 2012 mZxid = 0x2be mtime = Thu May 17 22:15:03 IST 2012 pZxid = 0x2be cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 10 numChildren = 0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278610#comment-13278610 ] Colin Patrick McCabe commented on HDFS-2982: There are lots and lots of unit tests would have to change if EditLogInputStream started requiring an init() call. Not to mention the subtle bugs that might crop up. That alone would almost be worth its own patch. Let's deal with this later if we decide it's something worth doing. Frankly, I would argue against it because I think there's better APIs we could design. In particular, an API which separates the concept of a stream from the concept of a stream location is much more efficient and results in cleaner code, because the invariant that you can't use something without initializing it is then enforced by the type system. So basically, can we revisit this idea later, as in after this week? bq. The new test case is missing the @Test annotation so it won't run. Will fix. bq. Are the changes to validateEditLog necessary here? And the change to how corrupt files are handled? It's often really time consuming to change these things because then I have to redo all the unit tests. Still, I will take a look at it. > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Colin Patrick McCabe >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278609#comment-13278609 ] Hudson commented on HDFS-3440: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2286 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2286/]) HDFS-3440. More effectively limit stream memory consumption when reading corrupt edit logs. Contributed by Colin Patrick McCabe. (Revision 1339978) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339978 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogInputStream.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupInputStream.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/StreamLimiter.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSEditLogLoader.java > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Fix For: 2.0.1 > > Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278591#comment-13278591 ] Todd Lipcon commented on HDFS-2982: --- Hi Colin. Many of my comments from HDFS-3049 still apply (eg about the lazy initialization of the reader stream) Are the changes to validateEditLog necessary here? And the change to how corrupt files are handled? It seems like they fit more appropriately into HDFS-3049. I think you should be able to separate those out from this performance fix. The new test case is missing the @Test annotation so it won't run. > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Colin Patrick McCabe >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-744) Support hsync in HDFS
[ https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HDFS-744: --- Attachment: HDFS-744-trunk-v5.patch New version of the patch. * Implemented all of Nicholas suggestions. * Added some simple tests. * Added a flushFS() method to SequeceFile.Writer. I would still prefer to implement to hsync() as flushOrSync(syncBlock) rather than flushOrSync(true), but this works too. > Support hsync in HDFS > - > > Key: HDFS-744 > URL: https://issues.apache.org/jira/browse/HDFS-744 > Project: Hadoop HDFS > Issue Type: New Feature > Components: data-node, hdfs client >Reporter: Hairong Kuang >Assignee: Lars Hofhansl > Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, > HDFS-744-trunk-v4.patch, HDFS-744-trunk-v5.patch, HDFS-744-trunk.patch, > hdfs-744-v2.txt, hdfs-744-v3.txt, hdfs-744.txt > > > HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, > the real expected semantics should be "flushes out to all replicas and all > replicas have done posix fsync equivalent - ie the OS has flushed it to the > disk device (but the disk may have it in its cache)." This jira aims to > implement the expected behaviour. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-3440: -- Resolution: Fixed Fix Version/s: 2.0.1 Target Version/s: 2.0.1 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Fix For: 2.0.1 > > Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278582#comment-13278582 ] Eli Collins commented on HDFS-2982: --- Hey Colin, Took a quick look. How about describing the high-level approach in the patch? - The javadoc for JournalSet#selectInputStreams is a little over-simplified =) - how about describing the algorithm (get the streams starting with fromTxid from all managers, return a list sorted by the starting txid etc) - In EditLogFileInputStream#init why only close the stream that threw? - In TestEditLog readAllEdits is dead code > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Colin Patrick McCabe >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278581#comment-13278581 ] Todd Lipcon commented on HDFS-3440: --- +1, looks good to me. Will commit this momentarily. > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278578#comment-13278578 ] Hadoop QA commented on HDFS-3440: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527995/HDFS-3440.002.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2469//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2469//console This message is automatically generated. > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278575#comment-13278575 ] Hadoop QA commented on HDFS-2982: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527992/HDFS-2982.001.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal: org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2468//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2468//console This message is automatically generated. > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Colin Patrick McCabe >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278574#comment-13278574 ] Hadoop QA commented on HDFS-3049: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527991/HDFS-3049.021.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal: org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2467//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2467//console This message is automatically generated. > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch, HDFS-3049.021.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly
[jira] [Assigned] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins reassigned HDFS-2982: - Assignee: Colin Patrick McCabe (was: Todd Lipcon) > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Colin Patrick McCabe >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-2982: -- Target Version/s: 2.0.1 (was: HA branch (HDFS-1623), 0.24.0) Affects Version/s: (was: 0.24.0) 2.0.0 > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3440: --- Attachment: HDFS-3440.002.patch * add test * address todd's comments > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278547#comment-13278547 ] Colin Patrick McCabe commented on HDFS-3049: FYI: I'm posting the patch for the startup performance stuff over at HDFS-2982. > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch, HDFS-3049.021.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-2982: --- Attachment: HDFS-2982.001.patch > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2982) Startup performance suffers when there are many edit log segments
[ https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-2982: --- Status: Patch Available (was: Open) > Startup performance suffers when there are many edit log segments > - > > Key: HDFS-2982 > URL: https://issues.apache.org/jira/browse/HDFS-2982 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.24.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Attachments: HDFS-2982.001.patch > > > For every one of the edit log segments, it seems like we are calling > listFiles on the edit log directory inside of {{findMaxTransaction}}. This is > killing performance, especially when there are many log segments and the > directory is stored on NFS. It is taking several minutes to start up the NN > when there are several thousand log segments present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3049: --- Attachment: HDFS-3049.021.patch smaller patch which strips out RedundantEditLogStream, StreamLimiter Fixed some comments, addressed todd's comments. Many Log.info messages changed to debug, etc. > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch, HDFS-3049.021.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-744) Support hsync in HDFS
[ https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278544#comment-13278544 ] Lars Hofhansl commented on HDFS-744: Alternatively we could add flushFS() to SequenceFile.Writer. > Support hsync in HDFS > - > > Key: HDFS-744 > URL: https://issues.apache.org/jira/browse/HDFS-744 > Project: Hadoop HDFS > Issue Type: New Feature > Components: data-node, hdfs client >Reporter: Hairong Kuang >Assignee: Lars Hofhansl > Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, > HDFS-744-trunk-v4.patch, HDFS-744-trunk.patch, hdfs-744-v2.txt, > hdfs-744-v3.txt, hdfs-744.txt > > > HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, > the real expected semantics should be "flushes out to all replicas and all > replicas have done posix fsync equivalent - ie the OS has flushed it to the > disk device (but the disk may have it in its cache)." This jira aims to > implement the expected behaviour. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)
[ https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278505#comment-13278505 ] Hadoop QA commented on HDFS-3415: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527967/HDFS-3415.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2466//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2466//console This message is automatically generated. > NameNode is getting shutdown by throwing nullpointer exception when one of > the layout version is different with others(Multiple storage dirs are > configured) > > > Key: HDFS-3415 > URL: https://issues.apache.org/jira/browse/HDFS-3415 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0, 3.0.0 > Environment: Suse linux + jdk 1.6 >Reporter: Brahma Reddy Battula >Assignee: Brandon Li > Attachments: HDFS-3415.patch > > > Scenario: > = > start Namenode and datanode by configuring three storage dir's for namenode > write 10 files > edit version file of one of the storage dir and give layout version as 123 > which different with default(-40). > Stop namenode > start Namenode. > Then I am getting follwong exception... > {noformat} > 2012-05-13 19:01:41,483 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151) > 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > SHUTDOWN_MSG: > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278500#comment-13278500 ] Hadoop QA commented on HDFS-3440: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527968/HDFS-3440.001.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2465//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2465//console This message is automatically generated. > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-744) Support hsync in HDFS
[ https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278488#comment-13278488 ] Lars Hofhansl commented on HDFS-744: Thanks Nicholas! I'll rename the flag and the flush method. The reason for not calling flush(true) was two-fold: # Code that currently uses hsync would suddenly get the new behavior. For example HBase which uses this via a SequenceFile.Writer would have no option to disable this (unless we expose a new flag to Writer.syncFS). # Without SYNC_BLOCK it kinda makes no sense (or would at least set the wrong expectation that everything is sync'ed to disk). Sorry about the tabs, I had used my default eclipse formatter. Will look at the Append tests and some new ones for hsync. > Support hsync in HDFS > - > > Key: HDFS-744 > URL: https://issues.apache.org/jira/browse/HDFS-744 > Project: Hadoop HDFS > Issue Type: New Feature > Components: data-node, hdfs client >Reporter: Hairong Kuang >Assignee: Lars Hofhansl > Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, > HDFS-744-trunk-v4.patch, HDFS-744-trunk.patch, hdfs-744-v2.txt, > hdfs-744-v3.txt, hdfs-744.txt > > > HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, > the real expected semantics should be "flushes out to all replicas and all > replicas have done posix fsync equivalent - ie the OS has flushed it to the > disk device (but the disk may have it in its cache)." This jira aims to > implement the expected behaviour. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278472#comment-13278472 ] Todd Lipcon commented on HDFS-3440: --- - StreamLImiter either needs to be package-private or marked with a private interface annotation - we don't generally mark interface methods as "abstract". In fact I didn't know that was legal java - can you refactor out the code that checks curPos+len against the limit into a {{checkLimit(int bytesToRead);}} call? - would be good to add a simple unit test of this functionality - eg construct a FSEditLogOp.Reader and give it a header which would cause it to try to read more than MAX_OP_SIZE, verify it throws the expected exception. > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278460#comment-13278460 ] Colin Patrick McCabe commented on HDFS-3440: (the "failure to apply patch" thing is related to the earlier patch I posted and then took down) > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3330) If GetImageServlet throws an Error or RTE, response has HTTP "OK" status
[ https://issues.apache.org/jira/browse/HDFS-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-3330: -- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: (was: 1.1.0, 2.0.0) Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Todd, nope. I committed to branch-1. Thanks! > If GetImageServlet throws an Error or RTE, response has HTTP "OK" status > > > Key: HDFS-3330 > URL: https://issues.apache.org/jira/browse/HDFS-3330 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 1.0.0, 2.0.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 1.1.0, 2.0.0 > > Attachments: hdfs-3330.txt > > > Currently in GetImageServlet, we catch Exception but not other Errors or > RTEs. So, if the code ends up throwing one of these exceptions, the > "response.sendError()" code doesn't run, but the finally clause does run. > This results in the servlet returning HTTP 200 OK and an empty response, > which causes the client to think it got a successful image transfer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278457#comment-13278457 ] Hadoop QA commented on HDFS-3440: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527964/number1.001.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2464//console This message is automatically generated. > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3440: --- Attachment: (was: number1.001.patch) > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)
[ https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-3415: - Attachment: HDFS-3415.patch Instead of allowing namenode to move forward with multiple layout versions, the storage inspector selector (NNStroage.readAndInspectDirs) should throw exception saying the inconsistent layout versions. > NameNode is getting shutdown by throwing nullpointer exception when one of > the layout version is different with others(Multiple storage dirs are > configured) > > > Key: HDFS-3415 > URL: https://issues.apache.org/jira/browse/HDFS-3415 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0, 3.0.0 > Environment: Suse linux + jdk 1.6 >Reporter: Brahma Reddy Battula >Assignee: Brandon Li > Attachments: HDFS-3415.patch > > > Scenario: > = > start Namenode and datanode by configuring three storage dir's for namenode > write 10 files > edit version file of one of the storage dir and give layout version as 123 > which different with default(-40). > Stop namenode > start Namenode. > Then I am getting follwong exception... > {noformat} > 2012-05-13 19:01:41,483 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151) > 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > SHUTDOWN_MSG: > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)
[ https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-3415: - Status: Patch Available (was: Open) > NameNode is getting shutdown by throwing nullpointer exception when one of > the layout version is different with others(Multiple storage dirs are > configured) > > > Key: HDFS-3415 > URL: https://issues.apache.org/jira/browse/HDFS-3415 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0, 3.0.0 > Environment: Suse linux + jdk 1.6 >Reporter: Brahma Reddy Battula >Assignee: Brandon Li > Attachments: HDFS-3415.patch > > > Scenario: > = > start Namenode and datanode by configuring three storage dir's for namenode > write 10 files > edit version file of one of the storage dir and give layout version as 123 > which different with default(-40). > Stop namenode > start Namenode. > Then I am getting follwong exception... > {noformat} > 2012-05-13 19:01:41,483 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151) > 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > SHUTDOWN_MSG: > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3440: --- Attachment: HDFS-3440.001.patch some unrelated stuff got mixed into the last patch > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3440.001.patch, number1.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3440: --- Status: Patch Available (was: Open) > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: number1.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
Colin Patrick McCabe created HDFS-3440: -- Summary: should more effectively limit stream memory consumption when reading corrupt edit logs Key: HDFS-3440 URL: https://issues.apache.org/jira/browse/HDFS-3440 Project: Hadoop HDFS Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Priority: Minor Attachments: number1.001.patch Currently, we do in.mark(100MB) before reading an opcode out of the edit log. However, this could result in us usin all of those 100 MB when reading bogus data, which is not what we want. It also could easily make some corrupt edit log files unreadable. We should have a stream limiter interface, that causes a clean IOException when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs
[ https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3440: --- Attachment: number1.001.patch > should more effectively limit stream memory consumption when reading corrupt > edit logs > -- > > Key: HDFS-3440 > URL: https://issues.apache.org/jira/browse/HDFS-3440 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: number1.001.patch > > > Currently, we do in.mark(100MB) before reading an opcode out of the edit log. > However, this could result in us usin all of those 100 MB when reading bogus > data, which is not what we want. It also could easily make some corrupt edit > log files unreadable. > We should have a stream limiter interface, that causes a clean IOException > when we're in this situation, and does not result in huge memory consumption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-744) Support hsync in HDFS
[ https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278444#comment-13278444 ] Tsz Wo (Nicholas), SZE commented on HDFS-744: - Thanks a lot, Lars! The patch looks good. Some comments: - Let's name the new CreateFlag as SYNC_BLOCK instead of FORCE. POSIX uses SYNC as you mentioned but POSIX SYNC means syncing every write. - DFSOutputStream.hsync(), -* It should call flush(true). It is better to sync the current block then not syncing at all. -* Need to update the javadoc to say that it only sync the current block. - Rename flush(force) to flushOrSync(isSync) in BlockReceiver and DFSOutputStream. Please also update the javadoc. - We do not use tabs in Hadoop. Indentation should use two spaces. - Please add some new tests. It is not easy to test whether sync actually works but at least add some new test to call hsync(). See TestFileAppend and TestFileAppend[234] to get some ideas. > Support hsync in HDFS > - > > Key: HDFS-744 > URL: https://issues.apache.org/jira/browse/HDFS-744 > Project: Hadoop HDFS > Issue Type: New Feature > Components: data-node, hdfs client >Reporter: Hairong Kuang >Assignee: Lars Hofhansl > Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, > HDFS-744-trunk-v4.patch, HDFS-744-trunk.patch, hdfs-744-v2.txt, > hdfs-744-v3.txt, hdfs-744.txt > > > HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, > the real expected semantics should be "flushes out to all replicas and all > replicas have done posix fsync equivalent - ie the OS has flushed it to the > disk device (but the disk may have it in its cache)." This jira aims to > implement the expected behaviour. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)
[ https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278409#comment-13278409 ] Brandon Li commented on HDFS-3415: -- I reproduce this problem with the following configuration: 1. set two storage directories, say dirA and dirB 2. start and then shutdown namenode 3. change only dirB's layout version from -40 to 123. 4. start namenode and it should fail with the above NullPointerException The problem here is: Two storage inspectors are used in namenode, FSImagePreTransactionalStorageInspector for layout version before -38, and FSImageTransactionalStorageInspector for -38 or anything later. In this case, the modified storage directory happens to be the last one inspected by the namenode in order to load image/edits. Even though it sees two layout version, it saves the last one ("123" in this case) as the storage layout version. However, it uses FSImageTransactionalStorageInspector to get image path because dirA still has -40 and then uses FSImagePreTransactionalStorageInspector to get edit stream. Because FSImagePreTransactionalStorageInspector can't recognize the file in a storage directory whose real version is newer, some references are not initialized which eventually cause the exception. > NameNode is getting shutdown by throwing nullpointer exception when one of > the layout version is different with others(Multiple storage dirs are > configured) > > > Key: HDFS-3415 > URL: https://issues.apache.org/jira/browse/HDFS-3415 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0, 3.0.0 > Environment: Suse linux + jdk 1.6 >Reporter: Brahma Reddy Battula >Assignee: Brandon Li > > Scenario: > = > start Namenode and datanode by configuring three storage dir's for namenode > write 10 files > edit version file of one of the storage dir and give layout version as 123 > which different with default(-40). > Stop namenode > start Namenode. > Then I am getting follwong exception... > {noformat} > 2012-05-13 19:01:41,483 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261) > at > org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151) > 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > SHUTDOWN_MSG: > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2276) src/test/unit tests not being run in mavenized HDFS
[ https://issues.apache.org/jira/browse/HDFS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-2276: -- Target Version/s: 2.0.1 Affects Version/s: (was: 0.23.0) 2.0.0 > src/test/unit tests not being run in mavenized HDFS > --- > > Key: HDFS-2276 > URL: https://issues.apache.org/jira/browse/HDFS-2276 > Project: Hadoop HDFS > Issue Type: Bug > Components: build, test >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-2276.txt > > > There are about 5 tests in src/test/unit that are no longer being run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278397#comment-13278397 ] Hudson commented on HDFS-3391: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2282 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2282/]) HDFS-3391. Fix InvalidateBlocks to compare blocks including their generation stamps. Contributed by Todd Lipcon. (Revision 1339897) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339897 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightHashSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightLinkedSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestLightWeightHashSet.java > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Fix For: 2.0.1 > > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2956) calling fetchdt without a --renewer argument throws NPE
[ https://issues.apache.org/jira/browse/HDFS-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278396#comment-13278396 ] Aaron T. Myers commented on HDFS-2956: -- Hey Daryn, any update here? I just bumped into this myself. > calling fetchdt without a --renewer argument throws NPE > --- > > Key: HDFS-2956 > URL: https://issues.apache.org/jira/browse/HDFS-2956 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 0.24.0 >Reporter: Todd Lipcon > > If I call "bin/hdfs fetchdt /tmp/mytoken" without a "--renewer foo" argument, > then it will throw a NullPointerException: > Exception in thread "main" java.lang.NullPointerException > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:830) > this is because getDelegationToken is being called with a null renewer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3439) Balancer exits if fs.defaultFS is set to a different, but semantically identical, URI from dfs.namenode.rpc-address
Aaron T. Myers created HDFS-3439: Summary: Balancer exits if fs.defaultFS is set to a different, but semantically identical, URI from dfs.namenode.rpc-address Key: HDFS-3439 URL: https://issues.apache.org/jira/browse/HDFS-3439 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.0.0 Reporter: Aaron T. Myers The balancer determines the set of NN URIs to balance by looking at fs.defaultFS and all possible dfs.namenode.(service)rpc-address settings. If fs.defaultFS is, for example, set to "hdfs://foo.example.com:8020/" (note the trailing "/") and the rpc-address is set to "hdfs://foo.example.com:8020" (without a "/"), then the balancer will conclude that there are two NNs and try to balance both. However, since both of these URIs refer to the same actual FS instance, the balancer will exit with "java.io.IOException: Another balancer is running. Exiting ..." -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278361#comment-13278361 ] Hudson commented on HDFS-3391: -- Integrated in Hadoop-Common-trunk-Commit #2264 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2264/]) HDFS-3391. Fix InvalidateBlocks to compare blocks including their generation stamps. Contributed by Todd Lipcon. (Revision 1339897) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339897 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightHashSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightLinkedSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestLightWeightHashSet.java > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Fix For: 2.0.1 > > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278359#comment-13278359 ] Hudson commented on HDFS-3391: -- Integrated in Hadoop-Hdfs-trunk-Commit #2337 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2337/]) HDFS-3391. Fix InvalidateBlocks to compare blocks including their generation stamps. Contributed by Todd Lipcon. (Revision 1339897) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339897 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightHashSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightLinkedSet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestLightWeightHashSet.java > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Fix For: 2.0.1 > > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-3391: -- Resolution: Fixed Fix Version/s: 2.0.1 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Fix For: 2.0.1 > > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278346#comment-13278346 ] Todd Lipcon commented on HDFS-3391: --- Thanks Nicholas. The two javadoc warnings above are due to gridmix, so unrelated to this patch. I'll commit this momentarily. > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278341#comment-13278341 ] Colin Patrick McCabe commented on HDFS-3049: bq. Also, I'm not sure why the exception is swallowed instead of rethrown. If it fails to open the edit log, shouldn't that generate an exception on init()? Should we make init() a public interface (eg "open()") instead, so that the caller is cognizant of the flow here, instead of doing it lazily? I think that would also simplify the other functions, which could just do a Preconditions.checkState(state == State.OPEN) instead of always handling lazy-init. You're right-- the exception should be thrown if resync == false. As for creating a public init() method-- I guess, but this patch is getting kind of big already. Perhaps we could file a separate JIRA for that? I also have some other API ideas that might improve efficiency (not going to discuss them here due to space constraints) bq. Why do you close() here but not close() in the normal case where you reach the end of the log? It seems it should be up to the caller to close upon hitting the "eof" (null txn) either way. The rationale behind this was discussed in HDFS-3335. Basically, if there is corruption at the end of the log, but we read everything we were supposed to, we don't want to throw an exception. As for closing in the eof case, that seems unecessary. The caller has to call close() anyway, that's part of the contract for this API. So we don't really add any value by doing it automatically. bq. again, why not just catch Throwable? Yeah, we should do that. Will fix. bq. IllegalStateException would be more appropriate here ok [streamlimiter comments] agree with most/all of this. I think this can be separated out (probably) [log message comments] yes, probably some of those should be debug comments. Probably at least the ones which just describe "situation normal, added new stream" etc. [separate into 3 patches] well, it's worth a try. There are some non-obvious dependencies, but I'll give it a try. > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278334#comment-13278334 ] Hadoop QA commented on HDFS-3391: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527912/hdfs-3391.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2463//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2463//console This message is automatically generated. > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278317#comment-13278317 ] Todd Lipcon commented on HDFS-3049: --- {code} + } catch (RuntimeException e) { +LOG.error("caught exception initializing " + this, e); +state = State.CLOSED; +return null; + } catch (IOException e) { +LOG.error("caught exception initializing " + this, e); +state = State.CLOSED; +return null; + } {code} Why not simply catch Throwable here? Also, I'm not sure why the exception is swallowed instead of rethrown. If it fails to open the edit log, shouldn't that generate an exception on init()? Should we make init() a public interface (eg "open()") instead, so that the caller is cognizant of the flow here, instead of doing it lazily? I think that would also simplify the other functions, which could just do a Preconditions.checkState(state == State.OPEN) instead of always handling lazy-init. {code} + LOG.info("skipping the rest of " + this + " since we " + + "reached txId " + txId); + close(); {code} Why do you close() here but not close() in the normal case where you reach the end of the log? It seems it should be up to the caller to close upon hitting the "eof" (null txn) either way. {code} try { - return reader.readOp(true); + return nextOpImpl(true); } catch (IOException e) { + LOG.error("nextValidOp: got exception while reading " + this, e); + return null; +} catch (RuntimeException e) { + LOG.error("nextValidOp: got exception while reading " + this, e); return null; } {code} again, why not just catch Throwable? {code} +if (!streams.isEmpty()) { + String error = String.format("Cannot start writing at txid %s " + +"when there is a stream available for read: %s", +segmentTxId, streams.get(0)); + IOUtils.cleanup(LOG, streams.toArray(new EditLogInputStream[0])); + throw new RuntimeException(error); } {code} IllegalStateException would be more appropriate here Changes to PositionTrackingInputStream: can you refactor out a function like {{checkLimit(int amountToRead);}} here? Lots of duplicate code. Why is the opcode size changed from 100MB to 1.5MB? Didn't you just change it to 100MB recently? Also, why is this change to the limiting behavior lumped in here? It's hard to review when the patch has a lot of distinct changes put together. StreamLimiter needs an interface annotation, or be made package private. - There are a lot of new LOG.info messages which look more like they should be debug level. I don't think the operator would be able to make sense of all this output. How hard would it be to separate this into three patches? 1) Bug fix which uses the new StreamLimiter to fix the issue you mentioned higher up in the comments (and seems distinct from the rest) 2) Change the API to get rid of getInputStream() and fix the O(n^2) behavior 3) Introduce RedundantInputStream to solve the issue described in this JIRA I think there really are three separate things going on here and the 120KB patch is difficult to digest. > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278298#comment-13278298 ] Tsz Wo (Nicholas), SZE commented on HDFS-3391: -- +1 patch looks good. Thanks. > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278297#comment-13278297 ] Colin Patrick McCabe commented on HDFS-3049: (noe: the javadoc warnings relate to gridmx and were not introduced by this change) > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3373) FileContext HDFS implementation can leak socket caches
[ https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John George reassigned HDFS-3373: - Assignee: John George > FileContext HDFS implementation can leak socket caches > -- > > Key: HDFS-3373 > URL: https://issues.apache.org/jira/browse/HDFS-3373 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: John George > > As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, > and thus never calls DFSClient.close(). This means that, until finalizers > run, DFSClient will hold on to its SocketCache object and potentially have a > lot of outstanding sockets/fds held on to. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-3391: -- Attachment: hdfs-3391.txt Attached patch addresses Nicholas's comments above. The one I did not address was to remove the TODO that references HDFS-2668. Since that TODO is not addressed by this JIRA, I think it's better to address it in HDFS-2668 itself. > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278212#comment-13278212 ] Tsz Wo (Nicholas), SZE commented on HDFS-3391: -- The patch looks good. Some comments: - All calls of invalidateBlocks.contains(..) have matchGenStamp == true. How about remove matchGenStamp from the parameters? - Remove the TODO below. {code} -if(invalidateBlocks.contains(dn.getStorageID(), block)) { +if(invalidateBlocks.contains(dn.getStorageID(), block, true)) { /* TODO: following assertion is incorrect, see HDFS-2668 assert storedBlock.findDatanode(dn) < 0 : "Block " + block + " in recentInvalidatesSet should not appear in DN " + dn; */ {code} - In LightWeightHashSet.getEqualElement(..), since the key will be cast to T, change the type to T and remove @SuppressWarnings("unchecked"). Then, we need to cast the key to T in contains(..). Add @Override to contains(..). Also, how about renaming getEqualElement to getElement? > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278200#comment-13278200 ] Hadoop QA commented on HDFS-3049: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527878/HDFS-3049.018.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2462//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2462//console This message is automatically generated. > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default
[jira] [Commented] (HDFS-3437) Remove name.node.address servlet attribute
[ https://issues.apache.org/jira/browse/HDFS-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278168#comment-13278168 ] Hadoop QA commented on HDFS-3437: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527871/hdfs-3437.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2461//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2461//console This message is automatically generated. > Remove name.node.address servlet attribute > -- > > Key: HDFS-3437 > URL: https://issues.apache.org/jira/browse/HDFS-3437 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Affects Versions: 2.0.0 >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-3437.txt > > > Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY > since we always call DfsServlet#createNameNodeProxy within the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3438) BootstrapStandby should not require a rollEdits on active node
Todd Lipcon created HDFS-3438: - Summary: BootstrapStandby should not require a rollEdits on active node Key: HDFS-3438 URL: https://issues.apache.org/jira/browse/HDFS-3438 Project: Hadoop HDFS Issue Type: Improvement Components: ha Affects Versions: 2.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Currently, BootstrapStandby uses a rollEdits() call on the active NN in order to determine the most recent checkpoint and transaction ID. However, this means that you cannot bootstrap a standby when the active NN is in safe mode -- i.e you have to start the whole cluster before you can bootstrap a new standby. This makes the workflow to convert an existing cluster to HA more complicated. We should allow BootstrapStandby to work even when the NN is in safe mode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2617) Replaced Kerberized SSL for image transfer and fsck with SPNEGO-based solution
[ https://issues.apache.org/jira/browse/HDFS-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278138#comment-13278138 ] Aaron T. Myers commented on HDFS-2617: -- bq. Should this be marked resolved or are we leaving it open for commit to 1.1? I'd like to put some version of this patch in 1.1, perhaps with a config option to continue to use KSSL if one wants so we don't necessarily break deployments that are currently successfully using KSSL. Perhaps we should resolve this one and open a new JIRA along the lines of "Back-port HDFS-2617 to branch-1" ? > Replaced Kerberized SSL for image transfer and fsck with SPNEGO-based solution > -- > > Key: HDFS-2617 > URL: https://issues.apache.org/jira/browse/HDFS-2617 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: Jakob Homan >Assignee: Jakob Homan > Fix For: 2.0.0 > > Attachments: HDFS-2617-a.patch, HDFS-2617-b.patch, > HDFS-2617-config.patch, HDFS-2617-trunk.patch, HDFS-2617-trunk.patch, > HDFS-2617-trunk.patch, HDFS-2617-trunk.patch > > > The current approach to secure and authenticate nn web services is based on > Kerberized SSL and was developed when a SPNEGO solution wasn't available. Now > that we have one, we can get rid of the non-standard KSSL and use SPNEGO > throughout. This will simplify setup and configuration. Also, Kerberized > SSL is a non-standard approach with its own quirks and dark corners > (HDFS-2386). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
[ https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278124#comment-13278124 ] Hudson commented on HDFS-2800: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2280 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2280/]) HDFS-2800. Fix cancellation of checkpoints in the standby node to be more reliable. Contributed by Todd Lipcon. (Revision 1339745) Result = ABORTED todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339745 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SaveNamespaceContext.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/Canceler.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java > HA: TestStandbyCheckpoints.testCheckpointCancellation is racy > - > > Key: HDFS-2800 > URL: https://issues.apache.org/jira/browse/HDFS-2800 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, test >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Todd Lipcon > Fix For: 2.0.1 > > Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, > hdfs-2800.txt > > > TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the > following assert on line 212 fail: > {code} > assertTrue(StandbyCheckpointer.getCanceledCount() > 0); > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.
[ https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278105#comment-13278105 ] Tsz Wo (Nicholas), SZE commented on HDFS-3436: -- Good catch! I think the bug is in the following: {code} //DataNode.transferReplicaForPipelineRecovery(..) synchronized(data) { if (data.isValidRbw(b)) { stage = BlockConstructionStage.TRANSFER_RBW; } else if (data.isValidBlock(b)) { stage = BlockConstructionStage.TRANSFER_FINALIZED; } else { final String r = data.getReplicaString(b.getBlockPoolId(), b.getBlockId()); throw new IOException(b + " is neither a RBW nor a Finalized, r=" + r); } storedGS = data.getStoredBlock(b.getBlockPoolId(), b.getBlockId()).getGenerationStamp(); if (storedGS < b.getGenerationStamp()) { throw new IOException( storedGS + " = storedGS < b.getGenerationStamp(), b=" + b); } visible = data.getReplicaVisibleLength(b); } {code} It should first call getStoredBlock(..) and then use the stored block to call isValidRbw(..) and isValidBlock(..). It expects GS to be updated but does not handle it correctly. > Append to file is failing when one of the datanode where the block present is > down. > --- > > Key: HDFS-3436 > URL: https://issues.apache.org/jira/browse/HDFS-3436 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 2.0.0 >Reporter: Brahma Reddy Battula >Assignee: Vinay > > Scenario: > = > 1. Cluster with 4 DataNodes. > 2. Written file to 3 DNs, DN1->DN2->DN3 > 3. Stopped DN3, > Now Append to file is failing due to addDatanode2ExistingPipeline is failed. > *CLinet Trace* > {noformat} > 2012-04-24 22:06:09,947 INFO hdfs.DFSClient > (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in > createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as ***:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:09,947 WARN hdfs.DFSClient > (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery > for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 > in pipeline *:50010, **:50010, *:50010: bad datanode **:50010 > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient (DFSOutputStream.java:run(549)) > - DataStreamer Exception > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient > (DFSOutputStream.java:hflush(1515)) - Error while syncing > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > {noformat} > *DataNode Trace* > {noformat} > 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver
[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278079#comment-13278079 ] Colin Patrick McCabe commented on HDFS-3049: general description of the changes between 15 and 17/18 (sorry for not posting this earlier, it was late): * update some unit tests. In particular TestFileJournalManager#getNumberOfTransactions now takes a parameter that specifies whether it stops counting transactions at a gap, or not. Some functions in the test want one behavior or the other. * fix an off-by-one error in checkForGaps. * remove some deadcode that was causing a findbugs warning * fix a case where String.format was getting the wrong number of args * fix validation of files with corrupt headers (basically, force a header read). > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt
[ https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3049: --- Attachment: HDFS-3049.018.patch * fix mockito stuff in TestFailureToReadEdits > During the normal loading NN startup process, fall back on a different > EditLog if we see one that is corrupt > > > Key: HDFS-3049 > URL: https://issues.apache.org/jira/browse/HDFS-3049 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Affects Versions: 0.23.0 >Reporter: Colin Patrick McCabe >Assignee: Colin Patrick McCabe >Priority: Minor > Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, > HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, > HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, > HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, > HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, > HDFS-3049.018.patch > > > During the NameNode startup process, we load an image, and then apply edit > logs to it until we believe that we have all the latest changes. > Unfortunately, if there is an I/O error while reading any of these files, in > most cases, we simply abort the startup process. We should try harder to > locate a readable edit log and/or image file. > *There are three main use cases for this feature:* > 1. If the operating system does not honor fsync (usually due to a > misconfiguration), a file may end up in an inconsistent state. > 2. In certain older releases where we did not use fallocate() or similar to > pre-reserve blocks, a disk full condition may cause a truncated log in one > edit directory. > 3. There may be a bug in HDFS which results in some of the data directories > receiving corrupt data, but not all. This is the least likely use case. > *Proposed changes to normal NN startup* > * We should try a different FSImage if we can't load the first one we try. > * We should examine other FSEditLogs if we can't load the first one(s) we try. > * We should fail if we can't find EditLogs that would bring us up to what we > believe is the latest transaction ID. > Proposed changes to recovery mode NN startup: > we should list out all the available storage directories and allow the > operator to select which one he wants to use. > Something like this: > {code} > Multiple storage directories found. > 1. /foo/bar > edits__curent__XYZ size:213421345 md5:2345345 > image size:213421345 md5:2345345 > 2. /foo/baz > edits__curent__XYZ size:213421345 md5:2345345345 > image size:213421345 md5:2345345 > Which one would you like to use? (1/2) > {code} > As usual in recovery mode, we want to be flexible about error handling. In > this case, this means that we should NOT fail if we can't find EditLogs that > would bring us up to what we believe is the latest transaction ID. > *Not addressed by this feature* > This feature will not address the case where an attempt to access the > NameNode name directory or directories hangs because of an I/O error. This > may happen, for example, when trying to load an image from a hard-mounted NFS > directory, when the NFS server has gone away. Just as now, the operator will > have to notice this problem and take steps to correct it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp
[ https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278070#comment-13278070 ] Hudson commented on HDFS-3434: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2279 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2279/]) HDFS-3434. InvalidProtocolBufferException when visiting DN browseDirectory.jsp. Contributed by Eli Collins (Revision 1339712) Result = ABORTED eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339712 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeHttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestJspHelper.java > InvalidProtocolBufferException when visiting DN browseDirectory.jsp > --- > > Key: HDFS-3434 > URL: https://issues.apache.org/jira/browse/HDFS-3434 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt > > > The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When > selecting the "haus04" under the "Node" table I get a link with the http > address which is bogus (the wildcard/http port not the nn rpc addr), which > results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to > 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: > Connection refused". The browse this file system link works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2936) File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication
[ https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278067#comment-13278067 ] Harsh J commented on HDFS-2936: --- Nicholas, bq. Thanks for updating the description. Are you suggesting to use dfs.namenode.replication.min for client-side check and use dfs.namenode.replication.min.for.write for server-side check? Sort of. The former (or whatever replaces the former) should only check file replication factor short-values, which is applied/changed during create/setReplicationFactor alone. Not live block count. This is still a server-side-check. Client side checks would be of no good to an admin. bq. BTW, "File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication" is the original design of dfs.namenode.replication.min. I think we should not change it. True that that was the intention. A non-behavior changing patch can also be made (wherein default of the for.write property will be what the original min property is). But lets at least provide a way for admins to enforce minimum replication _factors_ on files, without having to worry about pipelines and what not - if an admin so wishes to. Setting {{dfs.replication}} to final does not work, cause there are create() API calls and setrep() calls that bypass/disregard that config. Essentially thats what lead us down this path - to use minimum, but just at meta-level, not live-block level (as it is today). > File close()-ing hangs indefinitely if the number of live blocks does not > match the minimum replication > --- > > Key: HDFS-2936 > URL: https://issues.apache.org/jira/browse/HDFS-2936 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Affects Versions: 0.23.0 >Reporter: Harsh J >Assignee: Harsh J > Attachments: HDFS-2936.patch > > > If an admin wishes to enforce replication today for all the users of their > cluster, he may set {{dfs.namenode.replication.min}}. This property prevents > users from creating files with < expected replication factor. > However, the value of minimum replication set by the above value is also > checked at several other points, especially during completeFile (close) > operations. If a condition arises wherein a write's pipeline may have gotten > only < minimum nodes in it, the completeFile operation does not successfully > close the file and the client begins to hang waiting for NN to replicate the > last bad block in the background. This form of hard-guarantee can, for > example, bring down clusters of HBase during high xceiver load on DN, or disk > fill-ups on many of them, etc.. > I propose we should split the property in two parts: > * dfs.namenode.replication.min > ** Stays the same name, but only checks file creation time replication factor > value and during adjustments made via setrep/etc. > * dfs.namenode.replication.min.for.write > ** New property that disconnects the rest of the checks from the above > property, such as the checks done during block commit, file complete/close, > safemode checks for block availability, etc.. > Alternatively, we may also choose to remove the client-side hang of > completeFile/close calls with a set number of retries. This would further > require discussion about how a file-closure handle ought to be handled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3431) Improve fuse-dfs truncate
[ https://issues.apache.org/jira/browse/HDFS-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HDFS-3431: --- Description: Fuse-dfs truncate only works for size == 0 and per the function's comment is a "Weak implementation in that we just delete the file and then re-create it, but don't set the user, group, and times to the old file's metadata". Per HDFS-860 we should ENOTSUP when the size != 0 or the size of the file. Also, we should implement the ftruncate system call in FUSE. was:Fuse-dfs truncate only works for size == 0 and per the function's comment is a "Weak implementation in that we just delete the file and then re-create it, but don't set the user, group, and times to the old file's metadata". Per HDFS-860 we should ENOTSUP when the size != 0 or the size of the file. > Improve fuse-dfs truncate > - > > Key: HDFS-3431 > URL: https://issues.apache.org/jira/browse/HDFS-3431 > Project: Hadoop HDFS > Issue Type: Bug > Components: contrib/fuse-dfs >Reporter: Eli Collins >Priority: Minor > > Fuse-dfs truncate only works for size == 0 and per the function's comment is > a "Weak implementation in that we just delete the file and then re-create it, > but don't set the user, group, and times to the old file's metadata". Per > HDFS-860 we should ENOTSUP when the size != 0 or the size of the file. > Also, we should implement the ftruncate system call in FUSE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
[ https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278035#comment-13278035 ] Tsz Wo (Nicholas), SZE commented on HDFS-3391: -- Hi Todd, thanks for posting a patch. I will review it later today. > TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing > --- > > Key: HDFS-3391 > URL: https://issues.apache.org/jira/browse/HDFS-3391 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Arun C Murthy >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-3391.txt, hdfs-3391.txt > > > Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< > FAILURE! > -- > Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover > Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec > <<< FAILURE! > -- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
[ https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278031#comment-13278031 ] Hudson commented on HDFS-2800: -- Integrated in Hadoop-Common-trunk-Commit #2263 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2263/]) HDFS-2800. Fix cancellation of checkpoints in the standby node to be more reliable. Contributed by Todd Lipcon. (Revision 1339745) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339745 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SaveNamespaceContext.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/Canceler.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java > HA: TestStandbyCheckpoints.testCheckpointCancellation is racy > - > > Key: HDFS-2800 > URL: https://issues.apache.org/jira/browse/HDFS-2800 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, test >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Todd Lipcon > Fix For: 2.0.1 > > Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, > hdfs-2800.txt > > > TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the > following assert on line 212 fail: > {code} > assertTrue(StandbyCheckpointer.getCanceledCount() > 0); > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2936) File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication
[ https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278032#comment-13278032 ] Tsz Wo (Nicholas), SZE commented on HDFS-2936: -- Hi Harsh, Thanks for updating the description. Are you suggesting to use dfs.namenode.replication.min for client-side check and use dfs.namenode.replication.min.for.write for server-side check? BTW, "File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication" is the original design of dfs.namenode.replication.min. I think we should not change it. > File close()-ing hangs indefinitely if the number of live blocks does not > match the minimum replication > --- > > Key: HDFS-2936 > URL: https://issues.apache.org/jira/browse/HDFS-2936 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Affects Versions: 0.23.0 >Reporter: Harsh J >Assignee: Harsh J > Attachments: HDFS-2936.patch > > > If an admin wishes to enforce replication today for all the users of their > cluster, he may set {{dfs.namenode.replication.min}}. This property prevents > users from creating files with < expected replication factor. > However, the value of minimum replication set by the above value is also > checked at several other points, especially during completeFile (close) > operations. If a condition arises wherein a write's pipeline may have gotten > only < minimum nodes in it, the completeFile operation does not successfully > close the file and the client begins to hang waiting for NN to replicate the > last bad block in the background. This form of hard-guarantee can, for > example, bring down clusters of HBase during high xceiver load on DN, or disk > fill-ups on many of them, etc.. > I propose we should split the property in two parts: > * dfs.namenode.replication.min > ** Stays the same name, but only checks file creation time replication factor > value and during adjustments made via setrep/etc. > * dfs.namenode.replication.min.for.write > ** New property that disconnects the rest of the checks from the above > property, such as the checks done during block commit, file complete/close, > safemode checks for block availability, etc.. > Alternatively, we may also choose to remove the client-side hang of > completeFile/close calls with a set number of retries. This would further > require discussion about how a file-closure handle ought to be handled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3437) Remove name.node.address servlet attribute
[ https://issues.apache.org/jira/browse/HDFS-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-3437: -- Status: Patch Available (was: Open) > Remove name.node.address servlet attribute > -- > > Key: HDFS-3437 > URL: https://issues.apache.org/jira/browse/HDFS-3437 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Affects Versions: 2.0.0 >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-3437.txt > > > Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY > since we always call DfsServlet#createNameNodeProxy within the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3437) Remove name.node.address servlet attribute
[ https://issues.apache.org/jira/browse/HDFS-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-3437: -- Attachment: hdfs-3437.txt Patch attached. Not sure about the last case in testGetUgi, eg if NAMENODE_ADDRESS should take precedence over DELEGATION_PARAMETER, as eg printGotoForm sets both, and previously might have ignored NAMENODE_ADDRESS in favor of NAMENODE_ADDRESS_ATTRIBUTE_KEY. Don't think this changes behavior, and the new test passes with the current code but worth double checking. > Remove name.node.address servlet attribute > -- > > Key: HDFS-3437 > URL: https://issues.apache.org/jira/browse/HDFS-3437 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Affects Versions: 2.0.0 >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-3437.txt > > > Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY > since we always call DfsServlet#createNameNodeProxy within the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
[ https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278021#comment-13278021 ] Hudson commented on HDFS-2800: -- Integrated in Hadoop-Hdfs-trunk-Commit #2336 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2336/]) HDFS-2800. Fix cancellation of checkpoints in the standby node to be more reliable. Contributed by Todd Lipcon. (Revision 1339745) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339745 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SaveNamespaceContext.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/Canceler.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java > HA: TestStandbyCheckpoints.testCheckpointCancellation is racy > - > > Key: HDFS-2800 > URL: https://issues.apache.org/jira/browse/HDFS-2800 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, test >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Todd Lipcon > Fix For: 2.0.1 > > Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, > hdfs-2800.txt > > > TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the > following assert on line 212 fail: > {code} > assertTrue(StandbyCheckpointer.getCanceledCount() > 0); > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
[ https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-2800: -- Resolution: Fixed Fix Version/s: 2.0.1 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) > HA: TestStandbyCheckpoints.testCheckpointCancellation is racy > - > > Key: HDFS-2800 > URL: https://issues.apache.org/jira/browse/HDFS-2800 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, test >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Todd Lipcon > Fix For: 2.0.1 > > Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, > hdfs-2800.txt > > > TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the > following assert on line 212 fail: > {code} > assertTrue(StandbyCheckpointer.getCanceledCount() > 0); > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters
[ https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278013#comment-13278013 ] Hudson commented on HDFS-1153: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2278 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2278/]) HDFS-1153. dfsnodelist.jsp should handle invalid input parameters. Contributed by Ravi Phulari (Revision 1339706) Result = ABORTED eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339706 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java > dfsnodelist.jsp should handle invalid input parameters > -- > > Key: HDFS-1153 > URL: https://issues.apache.org/jira/browse/HDFS-1153 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 1.0.0 >Reporter: Ravi Phulari >Assignee: Ravi Phulari >Priority: Minor > Fix For: 2.0.1 > > Attachments: HDFS-1153.patch, hdfs-1153.txt > > > Navigation to dfsnodelist.jsp with invalid input parameters produces NPE and > HTTP 500 error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp
[ https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277995#comment-13277995 ] Eli Collins commented on HDFS-3434: --- Thanks Todd. I've committed this to trunk and merged to branch-2. Filed HDFS-3437 for removing name.node.address. > InvalidProtocolBufferException when visiting DN browseDirectory.jsp > --- > > Key: HDFS-3434 > URL: https://issues.apache.org/jira/browse/HDFS-3434 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt > > > The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When > selecting the "haus04" under the "Node" table I get a link with the http > address which is bogus (the wildcard/http port not the nn rpc addr), which > results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to > 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: > Connection refused". The browse this file system link works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp
[ https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277988#comment-13277988 ] Hudson commented on HDFS-3434: -- Integrated in Hadoop-Common-trunk-Commit #2262 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2262/]) HDFS-3434. InvalidProtocolBufferException when visiting DN browseDirectory.jsp. Contributed by Eli Collins (Revision 1339712) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339712 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeHttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestJspHelper.java > InvalidProtocolBufferException when visiting DN browseDirectory.jsp > --- > > Key: HDFS-3434 > URL: https://issues.apache.org/jira/browse/HDFS-3434 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt > > > The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When > selecting the "haus04" under the "Node" table I get a link with the http > address which is bogus (the wildcard/http port not the nn rpc addr), which > results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to > 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: > Connection refused". The browse this file system link works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters
[ https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277980#comment-13277980 ] Hudson commented on HDFS-1153: -- Integrated in Hadoop-Common-trunk-Commit #2261 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2261/]) HDFS-1153. dfsnodelist.jsp should handle invalid input parameters. Contributed by Ravi Phulari (Revision 1339706) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339706 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java > dfsnodelist.jsp should handle invalid input parameters > -- > > Key: HDFS-1153 > URL: https://issues.apache.org/jira/browse/HDFS-1153 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 1.0.0 >Reporter: Ravi Phulari >Assignee: Ravi Phulari >Priority: Minor > Fix For: 2.0.1 > > Attachments: HDFS-1153.patch, hdfs-1153.txt > > > Navigation to dfsnodelist.jsp with invalid input parameters produces NPE and > HTTP 500 error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters
[ https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277977#comment-13277977 ] Hudson commented on HDFS-1153: -- Integrated in Hadoop-Hdfs-trunk-Commit #2335 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2335/]) HDFS-1153. dfsnodelist.jsp should handle invalid input parameters. Contributed by Ravi Phulari (Revision 1339706) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339706 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java > dfsnodelist.jsp should handle invalid input parameters > -- > > Key: HDFS-1153 > URL: https://issues.apache.org/jira/browse/HDFS-1153 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 1.0.0 >Reporter: Ravi Phulari >Assignee: Ravi Phulari >Priority: Minor > Fix For: 2.0.1 > > Attachments: HDFS-1153.patch, hdfs-1153.txt > > > Navigation to dfsnodelist.jsp with invalid input parameters produces NPE and > HTTP 500 error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp
[ https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277978#comment-13277978 ] Hudson commented on HDFS-3434: -- Integrated in Hadoop-Hdfs-trunk-Commit #2335 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2335/]) HDFS-3434. InvalidProtocolBufferException when visiting DN browseDirectory.jsp. Contributed by Eli Collins (Revision 1339712) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339712 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeHttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestJspHelper.java > InvalidProtocolBufferException when visiting DN browseDirectory.jsp > --- > > Key: HDFS-3434 > URL: https://issues.apache.org/jira/browse/HDFS-3434 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eli Collins >Assignee: Eli Collins > Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt > > > The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When > selecting the "haus04" under the "Node" table I get a link with the http > address which is bogus (the wildcard/http port not the nn rpc addr), which > results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to > 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: > Connection refused". The browse this file system link works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters
[ https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-1153: -- Resolution: Fixed Status: Resolved (was: Patch Available) > dfsnodelist.jsp should handle invalid input parameters > -- > > Key: HDFS-1153 > URL: https://issues.apache.org/jira/browse/HDFS-1153 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 1.0.0 >Reporter: Ravi Phulari >Assignee: Ravi Phulari >Priority: Minor > Fix For: 2.0.1 > > Attachments: HDFS-1153.patch, hdfs-1153.txt > > > Navigation to dfsnodelist.jsp with invalid input parameters produces NPE and > HTTP 500 error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3437) Remove name.node.address servlet attribute
Eli Collins created HDFS-3437: - Summary: Remove name.node.address servlet attribute Key: HDFS-3437 URL: https://issues.apache.org/jira/browse/HDFS-3437 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Affects Versions: 2.0.0 Reporter: Eli Collins Assignee: Eli Collins Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY since we always call DfsServlet#createNameNodeProxy within the NN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters
[ https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-1153: -- Component/s: data-node Priority: Minor (was: Major) Target Version/s: (was: 2.0.1) Affects Version/s: (was: 0.20.2) (was: 0.20.1) 1.0.0 Fix Version/s: 2.0.1 I've committed this and merged to branch-2. Thanks Ravi! > dfsnodelist.jsp should handle invalid input parameters > -- > > Key: HDFS-1153 > URL: https://issues.apache.org/jira/browse/HDFS-1153 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 1.0.0 >Reporter: Ravi Phulari >Assignee: Ravi Phulari >Priority: Minor > Fix For: 2.0.1 > > Attachments: HDFS-1153.patch, hdfs-1153.txt > > > Navigation to dfsnodelist.jsp with invalid input parameters produces NPE and > HTTP 500 error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters
[ https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated HDFS-1153: -- Summary: dfsnodelist.jsp should handle invalid input parameters (was: The navigation to /dfsnodelist.jsp with invalid input parameters produces NPE and HTTP 500 error) > dfsnodelist.jsp should handle invalid input parameters > -- > > Key: HDFS-1153 > URL: https://issues.apache.org/jira/browse/HDFS-1153 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1, 0.20.2 >Reporter: Ravi Phulari >Assignee: Ravi Phulari > Attachments: HDFS-1153.patch, hdfs-1153.txt > > > Navigation to dfsnodelist.jsp with invalid input parameters produces NPE and > HTTP 500 error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
[ https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277960#comment-13277960 ] Eli Collins commented on HDFS-2800: --- +1 updated patch looks good > HA: TestStandbyCheckpoints.testCheckpointCancellation is racy > - > > Key: HDFS-2800 > URL: https://issues.apache.org/jira/browse/HDFS-2800 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, test >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Todd Lipcon > Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, > hdfs-2800.txt > > > TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the > following assert on line 212 fail: > {code} > assertTrue(StandbyCheckpointer.getCanceledCount() > 0); > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2717) BookKeeper Journal output stream doesn't check addComplete rc
[ https://issues.apache.org/jira/browse/HDFS-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277877#comment-13277877 ] Uma Maheswara Rao G commented on HDFS-2717: --- Patch looks great Ivan. @Rakesh or Jitendra, Do you have any more comments ? I am +1 on this patch. > BookKeeper Journal output stream doesn't check addComplete rc > - > > Key: HDFS-2717 > URL: https://issues.apache.org/jira/browse/HDFS-2717 > Project: Hadoop HDFS > Issue Type: Sub-task >Affects Versions: 3.0.0 >Reporter: Ivan Kelly >Assignee: Ivan Kelly > Attachments: HDFS-2717.2.diff, HDFS-2717.diff > > > As summary says, we're not checking the addComplete return code, so there's a > change of losing update. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled
[ https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277830#comment-13277830 ] Hudson commented on HDFS-3433: -- Integrated in Hadoop-Mapreduce-trunk #1082 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/]) HDFS-3433. GetImageServlet should allow administrative requestors when security is enabled. Contributed by Aaron T. Myers. (Revision 1339540) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java > GetImageServlet should allow administrative requestors when security is > enabled > --- > > Key: HDFS-3433 > URL: https://issues.apache.org/jira/browse/HDFS-3433 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers > Fix For: 2.0.1 > > Attachments: HDFS-3433.patch > > > Currently the GetImageServlet only allows the NN and checkpointing nodes to > connect. Since we now have the fetchImage command in DFSAdmin, we should also > allow administrative requests as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-860) fuse-dfs truncate behavior causes issues with scp
[ https://issues.apache.org/jira/browse/HDFS-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277829#comment-13277829 ] Hudson commented on HDFS-860: - Integrated in Hadoop-Mapreduce-trunk #1082 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/]) HDFS-860. fuse-dfs truncate behavior causes issues with scp. Contributed by Brian Bockelman (Revision 1339413) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339413 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/fuse-dfs/src/fuse_impls_truncate.c > fuse-dfs truncate behavior causes issues with scp > - > > Key: HDFS-860 > URL: https://issues.apache.org/jira/browse/HDFS-860 > Project: Hadoop HDFS > Issue Type: Wish > Components: contrib/fuse-dfs >Reporter: Brian Bockelman >Assignee: Brian Bockelman >Priority: Minor > Fix For: 2.0.0 > > Attachments: HDFS-860.patch, hdfs-860.txt > > > For whatever reason, scp issues a "truncate" once it's written a file to > truncate the file to the # of bytes it has written (i.e., if a file is X > bytes, it calls truncate(X)). > This fails on the current fuse-dfs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3413) TestFailureToReadEdits timing out
[ https://issues.apache.org/jira/browse/HDFS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277824#comment-13277824 ] Hudson commented on HDFS-3413: -- Integrated in Hadoop-Mapreduce-trunk #1082 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/]) HDFS-3413. TestFailureToReadEdits timing out. Contributed by Aaron T. Myers. (Revision 1339250) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339250 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestFailureToReadEdits.java > TestFailureToReadEdits timing out > - > > Key: HDFS-3413 > URL: https://issues.apache.org/jira/browse/HDFS-3413 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, test >Affects Versions: 2.0.0, 3.0.0 >Reporter: Todd Lipcon >Assignee: Aaron T. Myers >Priority: Critical > Fix For: 2.0.1 > > Attachments: HDFS-3413.patch > > > HDFS-3026 made it so that an HA NN that does not fully complete an HA state > transition will exit immediately. TestFailureToReadEdits has a test case > which causes an incomplete state transition, thus causing a JVM exit in the > test. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3422) TestStandbyIsHot timeouts too aggressive
[ https://issues.apache.org/jira/browse/HDFS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277825#comment-13277825 ] Hudson commented on HDFS-3422: -- Integrated in Hadoop-Mapreduce-trunk #1082 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/]) HDFS-3422. TestStandbyIsHot timeouts too aggressive. Contributed by Todd Lipcon. (Revision 1339452) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339452 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java > TestStandbyIsHot timeouts too aggressive > > > Key: HDFS-3422 > URL: https://issues.apache.org/jira/browse/HDFS-3422 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Minor > Fix For: 2.0.1 > > Attachments: hdfs-3422.txt > > > I've seen TestStandbyIsHot timeout a few times waiting for replication to > change, but when I look at the logs, it appears everything is fine. The block > deletions are just a bit slow in being processed. > To improve the test we should either increase the timeouts, or explicitly > trigger heartbeats and replication work after changing the desired > replication levels. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3422) TestStandbyIsHot timeouts too aggressive
[ https://issues.apache.org/jira/browse/HDFS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277738#comment-13277738 ] Hudson commented on HDFS-3422: -- Integrated in Hadoop-Hdfs-trunk #1048 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1048/]) HDFS-3422. TestStandbyIsHot timeouts too aggressive. Contributed by Todd Lipcon. (Revision 1339452) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339452 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java > TestStandbyIsHot timeouts too aggressive > > > Key: HDFS-3422 > URL: https://issues.apache.org/jira/browse/HDFS-3422 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.0.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Minor > Fix For: 2.0.1 > > Attachments: hdfs-3422.txt > > > I've seen TestStandbyIsHot timeout a few times waiting for replication to > change, but when I look at the logs, it appears everything is fine. The block > deletions are just a bit slow in being processed. > To improve the test we should either increase the timeouts, or explicitly > trigger heartbeats and replication work after changing the desired > replication levels. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled
[ https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277741#comment-13277741 ] Hudson commented on HDFS-3433: -- Integrated in Hadoop-Hdfs-trunk #1048 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1048/]) HDFS-3433. GetImageServlet should allow administrative requestors when security is enabled. Contributed by Aaron T. Myers. (Revision 1339540) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java > GetImageServlet should allow administrative requestors when security is > enabled > --- > > Key: HDFS-3433 > URL: https://issues.apache.org/jira/browse/HDFS-3433 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers > Fix For: 2.0.1 > > Attachments: HDFS-3433.patch > > > Currently the GetImageServlet only allows the NN and checkpointing nodes to > connect. Since we now have the fetchImage command in DFSAdmin, we should also > allow administrative requests as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-860) fuse-dfs truncate behavior causes issues with scp
[ https://issues.apache.org/jira/browse/HDFS-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277740#comment-13277740 ] Hudson commented on HDFS-860: - Integrated in Hadoop-Hdfs-trunk #1048 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1048/]) HDFS-860. fuse-dfs truncate behavior causes issues with scp. Contributed by Brian Bockelman (Revision 1339413) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339413 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/fuse-dfs/src/fuse_impls_truncate.c > fuse-dfs truncate behavior causes issues with scp > - > > Key: HDFS-860 > URL: https://issues.apache.org/jira/browse/HDFS-860 > Project: Hadoop HDFS > Issue Type: Wish > Components: contrib/fuse-dfs >Reporter: Brian Bockelman >Assignee: Brian Bockelman >Priority: Minor > Fix For: 2.0.0 > > Attachments: HDFS-860.patch, hdfs-860.txt > > > For whatever reason, scp issues a "truncate" once it's written a file to > truncate the file to the # of bytes it has written (i.e., if a file is X > bytes, it calls truncate(X)). > This fails on the current fuse-dfs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled
[ https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277730#comment-13277730 ] Hudson commented on HDFS-3433: -- Integrated in Hadoop-Mapreduce-trunk-Commit #2275 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2275/]) HDFS-3433. GetImageServlet should allow administrative requestors when security is enabled. Contributed by Aaron T. Myers. (Revision 1339540) Result = ABORTED atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java > GetImageServlet should allow administrative requestors when security is > enabled > --- > > Key: HDFS-3433 > URL: https://issues.apache.org/jira/browse/HDFS-3433 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers > Fix For: 2.0.1 > > Attachments: HDFS-3433.patch > > > Currently the GetImageServlet only allows the NN and checkpointing nodes to > connect. Since we now have the fetchImage command in DFSAdmin, we should also > allow administrative requests as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes
[ https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277723#comment-13277723 ] Rakesh R commented on HDFS-3423: In the patch I'm seeing only maxTxId.reset(maxTxId.get()-1); is invoked on SegmentEmptyException. But I'm thinking about the inprogress_x ledgers which are not empty and had previously finalized but not deleted. The following code I have taken from BKJM. When l.verify(zkc, finalisedPath) == true, here instead of storing the maxTxId and deleting the znode, we will only delete the inprogress_x node as we had corresponding edit_x_y log file exists. IMHO, this is more safer. what's your opinion? {noformat} try { l.write(zkc, finalisedPath); } catch (KeeperException.NodeExistsException nee) { if (!l.verify(zkc, finalisedPath)) { throw new IOException("Node " + finalisedPath + " already exists" + " but data doesn't match"); } } maxTxId.store(lastTxId); zkc.delete(inprogressPath, inprogressStat.getVersion()); {noformat} > BookKeeperJournalManager: NN startup is failing, when tries to > recoverUnfinalizedSegments() a bad inProgress_ ZNodes > > > Key: HDFS-3423 > URL: https://issues.apache.org/jira/browse/HDFS-3423 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Rakesh R >Assignee: Ivan Kelly > Attachments: HDFS-3423.diff > > > Say, the InProgress_000X node is corrupted due to not writing the > data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode > startup has the logic to recover all the unfinalized segments, here will try > to read the segment and getting shutdown. > {noformat} > EditLogLedgerMetadata.java: > static EditLogLedgerMetadata read(ZooKeeper zkc, String path) > throws IOException, KeeperException.NoNodeException { > byte[] data = zkc.getData(path, false, null); > String[] parts = new String(data).split(";"); > if (parts.length == 3) > reading inprogress metadata > else if (parts.length == 4) > reading inprogress metadata > else > throw new IOException("Invalid ledger entry, " > + new String(data)); > } > {noformat} > Scenario:- Leaving bad inProgress_000X node ? > Assume BKJM has created the inProgress_000X zNode and ZK is not available > when trying to add the metadata. Now, inProgress_000X ends up with partial > information. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.
[ https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277724#comment-13277724 ] Vinay commented on HDFS-3436: - Scenario is as follows: - 1. Cluster is having 4 DNs. 2. File is written to 3 DNs DN1->DN2->DN3 with genstamp of 1001 3. Now DN3 is stopped. 4. Now append is called. 5. For this append Client will try to create the pipeline DN1->DN2->DN3 During this time following things will happen 1. The Generation stamp will be updated in volumeMap to 1002 2. Now datanode will try to connect to next DN in pipeline. If Next DN in pipeline is down, then exception will be thrown and client will try to reform the pipeline. Now since DN3 is down, in DN1 and DN2 genstamp is already updated to 1002. But client doesnot know about this. 6. Now client is trying to add one more datanode to append pipeline. i.e. DN4. and ask DN1 or DN2 to transfer block to DN4. But Client will ask to transfer block with genstamp 1001. 7. Since DN1 and DN2 dont have block with genstamp 1001, so transfer will fail and Client write also will fail. Proposed solution -- In DataXceiver#writeBlock(), before creating the BlockReceiver instance, if we try to create mirror connection, then this solves the problem. > Append to file is failing when one of the datanode where the block present is > down. > --- > > Key: HDFS-3436 > URL: https://issues.apache.org/jira/browse/HDFS-3436 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 2.0.0 >Reporter: Brahma Reddy Battula >Assignee: Vinay > > Scenario: > = > 1. Cluster with 4 DataNodes. > 2. Written file to 3 DNs, DN1->DN2->DN3 > 3. Stopped DN3, > Now Append to file is failing due to addDatanode2ExistingPipeline is failed. > *CLinet Trace* > {noformat} > 2012-04-24 22:06:09,947 INFO hdfs.DFSClient > (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in > createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as ***:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:09,947 WARN hdfs.DFSClient > (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery > for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 > in pipeline *:50010, **:50010, *:50010: bad datanode **:50010 > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient (DFSOutputStream.java:run(549)) > - DataStreamer Exception > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient > (DFSOutputStream.java:hflush(1515)) - Error while syncing > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > {noformat} > *DataNode Trace* > {no
[jira] [Updated] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.
[ https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinay updated HDFS-3436: Description: Scenario: = 1. Cluster with 4 DataNodes. 2. Written file to 3 DNs, DN1->DN2->DN3 3. Stopped DN3, Now Append to file is failing due to addDatanode2ExistingPipeline is failed. *CLinet Trace* {noformat} 2012-04-24 22:06:09,947 INFO hdfs.DFSClient (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as ***:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) 2012-04-24 22:06:09,947 WARN hdfs.DFSClient (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 in pipeline *:50010, **:50010, *:50010: bad datanode **:50010 2012-04-24 22:06:10,072 WARN hdfs.DFSClient (DFSOutputStream.java:run(549)) - DataStreamer Exception java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) 2012-04-24 22:06:10,072 WARN hdfs.DFSClient (DFSOutputStream.java:hflush(1515)) - Error while syncing java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) {noformat} *DataNode Trace* {noformat} 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation src: /127.0.0.1:49811 dest: /127.0.0.1:49744 java.io.IOException: BP-2001850558-xx.xx.xx.xx-1337249347060:blk_-8165642083860293107_1002 is neither a RBW nor a Finalized, r=ReplicaBeingWritten, blk_-8165642083860293107_1003, RBW getNumBytes() = 1024 getBytesOnDisk() = 1024 getVisibleLength()= 1024 getVolume() = E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current getBlockFile()= E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-xx.xx.xx.xx-1337249347060\current\rbw\blk_-8165642083860293107 bytesAcked=1024 bytesOnDisk=102 at org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:114) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:78) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:189) at java.lang.Thread.run(Unknown Source) {noformat} was: Scenario: = 1. Cluster with 4 DataNodes. 2. Written file to 3 DNs, DN1->DN2->DN3 3. Stopped DN3, Now Append to file is failing due to addDatanode2ExistingPipeline is failed. *CLinet Trace* {noformat} 2012-04-24 22:06:09,947 INFO hdfs.DFSClient (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as **
[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled
[ https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277715#comment-13277715 ] Hudson commented on HDFS-3433: -- Integrated in Hadoop-Common-trunk-Commit #2258 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2258/]) HDFS-3433. GetImageServlet should allow administrative requestors when security is enabled. Contributed by Aaron T. Myers. (Revision 1339540) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java > GetImageServlet should allow administrative requestors when security is > enabled > --- > > Key: HDFS-3433 > URL: https://issues.apache.org/jira/browse/HDFS-3433 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers > Fix For: 2.0.1 > > Attachments: HDFS-3433.patch > > > Currently the GetImageServlet only allows the NN and checkpointing nodes to > connect. Since we now have the fetchImage command in DFSAdmin, we should also > allow administrative requests as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.
[ https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinay reassigned HDFS-3436: --- Assignee: Vinay > Append to file is failing when one of the datanode where the block present is > down. > --- > > Key: HDFS-3436 > URL: https://issues.apache.org/jira/browse/HDFS-3436 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 2.0.0 >Reporter: Brahma Reddy Battula >Assignee: Vinay > > Scenario: > = > 1. Cluster with 4 DataNodes. > 2. Written file to 3 DNs, DN1->DN2->DN3 > 3. Stopped DN3, > Now Append to file is failing due to addDatanode2ExistingPipeline is failed. > *CLinet Trace* > {noformat} > 2012-04-24 22:06:09,947 INFO hdfs.DFSClient > (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in > createBlockOutputStream > java.io.IOException: Bad connect ack with firstBadLink as ***:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:09,947 WARN hdfs.DFSClient > (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery > for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 > in pipeline *:50010, **:50010, *:50010: bad datanode **:50010 > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient (DFSOutputStream.java:run(549)) > - DataStreamer Exception > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > 2012-04-24 22:06:10,072 WARN hdfs.DFSClient > (DFSOutputStream.java:hflush(1515)) - Error while syncing > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) > {noformat} > *DataNode Trace* > {noformat} > 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - > host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation > src: /127.0.0.1:49811 dest: /127.0.0.1:49744 > java.io.IOException: > BP-2001850558-10.18.47.190-1337249347060:blk_-8165642083860293107_1002 is > neither a RBW nor a Finalized, r=ReplicaBeingWritten, > blk_-8165642083860293107_1003, RBW > getNumBytes() = 1024 > getBytesOnDisk() = 1024 > getVisibleLength()= 1024 > getVolume() = > E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current > getBlockFile()= > E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-10.18.47.190-1337249347060\current\rbw\blk_-8165642083860293107 > bytesAcked=1024 > bytesOnDisk=102 > at > org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:114) > at > org.apache.hadoop.hdfs.protocol.d
[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled
[ https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277713#comment-13277713 ] Hudson commented on HDFS-3433: -- Integrated in Hadoop-Hdfs-trunk-Commit #2332 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2332/]) HDFS-3433. GetImageServlet should allow administrative requestors when security is enabled. Contributed by Aaron T. Myers. (Revision 1339540) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java > GetImageServlet should allow administrative requestors when security is > enabled > --- > > Key: HDFS-3433 > URL: https://issues.apache.org/jira/browse/HDFS-3433 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers > Fix For: 2.0.1 > > Attachments: HDFS-3433.patch > > > Currently the GetImageServlet only allows the NN and checkpointing nodes to > connect. Since we now have the fetchImage command in DFSAdmin, we should also > allow administrative requests as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira