[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278619#comment-13278619
 ] 

Colin Patrick McCabe commented on HDFS-2982:


bq. The javadoc for JournalSet#selectInputStreams is a little over-simplified 
=) - how about describing the algorithm (get the streams starting with fromTxid 
from all managers, return a list sorted by the starting txid etc)

Ok, will add.

bq. In EditLogFileInputStream#init why only close the stream that threw?

Yeah, I guess closing an already closed stream should be idempotent, at least 
if they're correctly implementing the Closable interface.

bq. In TestEditLog readAllEdits is dead code

ok

bq. How about describing the high-level approach in the patch?

>From the high level, this patch is about getting rid of two APIs in 
>JournalManager-- getNumberOfTransactions and getInputStream, and adding one 
>API to JournalManager-- selectInputStreams.  The new API simply gathers up all 
>the available streams in one go and puts them into a Collection.  This is more 
>efficient, and also better for some of the changes we'd like to make in the 
>future, like supporting overlapping edit log streams.

Edit log validation is the process of finding out how far in-progress edit logs 
go.  We do it during edit log finalization so that we can find out what to 
rename the in-progress edit log file to.  ("validation" might not be a great 
name for this process, but it's probably too late to change it now.)  We don't 
validate finalized logs.

There are some minor changes to validation here, and a major change.

First, the minor changes.  One change is to have the validation class contain 
only the end txid, rather than the start txid, number of txids, and end txid.  
The start txid is already known, and the number of txids does not represent 
what you might think, but merely end - start + 1.  So it's good to get rid of 
that cruft.  Another minor change is that EditLogValidation#corruptionDetected 
was renamed to EditLogValidation#hasCorruptHeader.  That is the concept it 
always represented-- it never referred to anything other than header 
corruption, and the rest of the code even uses the terminology hasCorruptHeader 
to represent this info (see EditLogFile#hasCorruptHeader).  So I'm just trying 
to be consistent.

The major change is that we now read to the end of a corrupt file in 
validation, finding the true end transaction rather than merely the first 
unreadable txid.  This is needed for recovery to work properly on these files.  
It's possible that this change could be dropped from this patch.  Conceptually, 
it's more related to HDFS-3049.

> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-17 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278617#comment-13278617
 ] 

Rakesh R commented on HDFS-3441:


Yeah, I have seen a race condition between the purgeLogsOlderThan() by Standby 
and the finalizeLogSegment() by Active.

Cause: Following are the sequence of operations happening:
1) When standby comes to purge, it is reading all the list of ledger logs 
including the inprogress_72 files. 
2) Meantime Active NN is finalizing the logSegment inprogress_72 and creating 
new inprogress_74. 
3) Now the Standby is reading the data of inprogress_72 to decide whether its 
inprogress or not and is throwing NoNodeException.


I feel, the filtering of inprogress file could be done based on the file name 
itself, instead of reading the content and filtering based on the data like as 
follows:

BookKeeperJournalManager.java
{noformat}
List ledgerNames = zkc.getChildren(ledgerPath, false);
  for (String ledgerName : ledgerNames) {
if( !inProgressOk && ledgerName.contains("inprogress") ){
  continue;
}
ledgers.add(EditLogLedgerMetadata.read(zkc, ledgerPath + "/" + 
ledgerName));
  }
} catch (Exception e) {
  throw new IOException("Exception reading ledger list from zk", e);
}
{noformat}


> Race condition between rolling logs at active NN and purging at standby
> ---
>
> Key: HDFS-3441
> URL: https://issues.apache.org/jira/browse/HDFS-3441
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: suja s
>
> Standby NN has got the ledgerlist with list of all files, including the 
> inprogress file (with say inprogress_val1)
> Active NN has done finalization and created new inprogress file.
> Standby when proceeds further finds that the inprogress file which it had in 
> the list is not present and NN gets shutdown
> NN Logs
> =
> 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Image file of size 201 saved in 0 seconds.
> 2012-05-17 22:15:03,874 INFO 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll 
> on remote NameNode /xx.xx.xx.102:8020
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
> retain 2 images with txid >= 111
> 2012-05-17 22:15:03,923 INFO 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
> image 
> FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
>  cpktTxId=109)
> 2012-05-17 22:15:03,961 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 
> failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
>  stream=null))
> java.io.IOException: Exception reading ledger list from zk
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
>   at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$Checkpoint

[jira] [Updated] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-17 Thread suja s (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

suja s updated HDFS-3441:
-

Description: 
Standby NN has got the ledgerlist with list of all files, including the 
inprogress file (with say inprogress_val1)
Active NN has done finalization and created new inprogress file.
Standby when proceeds further finds that the inprogress file which it had in 
the list is not present and NN gets shutdown


NN Logs
=
2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Image file of size 201 saved in 0 seconds.
2012-05-17 22:15:03,874 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on 
remote NameNode /xx.xx.xx.102:8020
2012-05-17 22:15:03,923 INFO 
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
retain 2 images with txid >= 111
2012-05-17 22:15:03,923 INFO 
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
image 
FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
 cpktTxId=109)
2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Error: purgeLogsOlderThan 0 failed for required journal 
(JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
 stream=null))
java.io.IOException: Exception reading ledger list from zk
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
at 
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
at 
org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
... 16 more
2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
SHUTDOWN_MSG: 


ZK Data


[zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
-40;59;116
cZxid = 0x2be
ctime = Thu May 17 22:15:03 IST 2012
mZxid = 0x2be
mtime = Thu May 17 22:15:03 IST 2012
pZxid = 0x2be
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 10
numChildren = 0

  was:
Standby NN has got the ledgerlist with list of all files, including the 
inprogress file (with say inprogress_val1)
Active NN has done finalization and created new inprogress file.
Standby when proceeds further finds that the inprogress file which it had in 
the list is not present and NN gets shutdown


NN Logs
=
2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Image file of size 201 saved in 0 seconds.
2012-05-17 22:15:03,874 INFO 
org.apache.hadoop.hdfs.

[jira] [Commented] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.

2012-05-17 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278613#comment-13278613
 ] 

Vinay commented on HDFS-3436:
-

Thanks Nicholas, that works. I will upload a patch for that.

> Append to file is failing when one of the datanode where the block present is 
> down.
> ---
>
> Key: HDFS-3436
> URL: https://issues.apache.org/jira/browse/HDFS-3436
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 2.0.0
>Reporter: Brahma Reddy Battula
>Assignee: Vinay
>
> Scenario:
> =
> 1. Cluster with 4 DataNodes.
> 2. Written file to 3 DNs, DN1->DN2->DN3
> 3. Stopped DN3,
> Now Append to file is failing due to addDatanode2ExistingPipeline is failed.
>  *CLinet Trace* 
> {noformat}
> 2012-04-24 22:06:09,947 INFO  hdfs.DFSClient 
> (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in 
> createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as ***:50010
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:09,947 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery 
> for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 
> in pipeline *:50010, **:50010, *:50010: bad datanode **:50010
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient (DFSOutputStream.java:run(549)) 
> - DataStreamer Exception
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:hflush(1515)) - Error while syncing
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> {noformat}
>  *DataNode Trace*  
> {noformat}
> 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - 
> host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation  
> src: /127.0.0.1:49811 dest: /127.0.0.1:49744
> java.io.IOException: 
> BP-2001850558-xx.xx.xx.xx-1337249347060:blk_-8165642083860293107_1002 is 
> neither a RBW nor a Finalized, r=ReplicaBeingWritten, 
> blk_-8165642083860293107_1003, RBW
>   getNumBytes() = 1024
>   getBytesOnDisk()  = 1024
>   getVisibleLength()= 1024
>   getVolume()   = 
> E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current
>   getBlockFile()= 
> E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-xx.xx.xx.xx-1337249347060\current\rbw\blk_-8165642083860293107
>   bytesAcked=1024
>   bytesOnDisk=102
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525)
>   at 
> org.apache.hadoop.hdfs.protocol.da

[jira] [Work started] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.

2012-05-17 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-3436 started by Vinay.

> Append to file is failing when one of the datanode where the block present is 
> down.
> ---
>
> Key: HDFS-3436
> URL: https://issues.apache.org/jira/browse/HDFS-3436
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 2.0.0
>Reporter: Brahma Reddy Battula
>Assignee: Vinay
>
> Scenario:
> =
> 1. Cluster with 4 DataNodes.
> 2. Written file to 3 DNs, DN1->DN2->DN3
> 3. Stopped DN3,
> Now Append to file is failing due to addDatanode2ExistingPipeline is failed.
>  *CLinet Trace* 
> {noformat}
> 2012-04-24 22:06:09,947 INFO  hdfs.DFSClient 
> (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in 
> createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as ***:50010
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:09,947 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery 
> for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 
> in pipeline *:50010, **:50010, *:50010: bad datanode **:50010
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient (DFSOutputStream.java:run(549)) 
> - DataStreamer Exception
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:hflush(1515)) - Error while syncing
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> {noformat}
>  *DataNode Trace*  
> {noformat}
> 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - 
> host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation  
> src: /127.0.0.1:49811 dest: /127.0.0.1:49744
> java.io.IOException: 
> BP-2001850558-xx.xx.xx.xx-1337249347060:blk_-8165642083860293107_1002 is 
> neither a RBW nor a Finalized, r=ReplicaBeingWritten, 
> blk_-8165642083860293107_1003, RBW
>   getNumBytes() = 1024
>   getBytesOnDisk()  = 1024
>   getVisibleLength()= 1024
>   getVolume()   = 
> E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current
>   getBlockFile()= 
> E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-xx.xx.xx.xx-1337249347060\current\rbw\blk_-8165642083860293107
>   bytesAcked=1024
>   bytesOnDisk=102
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:114)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:78

[jira] [Created] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby

2012-05-17 Thread suja s (JIRA)
suja s created HDFS-3441:


 Summary: Race condition between rolling logs at active NN and 
purging at standby
 Key: HDFS-3441
 URL: https://issues.apache.org/jira/browse/HDFS-3441
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: suja s


Standby NN has got the ledgerlist with list of all files, including the 
inprogress file (with say inprogress_val1)
Active NN has done finalization and created new inprogress file.
Standby when proceeds further finds that the inprogress file which it had in 
the list is not present and NN gets shutdown


NN Logs
=
2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Image file of size 201 saved in 0 seconds.
2012-05-17 22:15:03,874 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on 
remote NameNode /10.18.40.102:8020
2012-05-17 22:15:03,923 INFO 
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to 
retain 2 images with txid >= 111
2012-05-17 22:15:03,923 INFO 
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old 
image 
FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_109,
 cpktTxId=109)
2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Error: purgeLogsOlderThan 0 failed for required journal 
(JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767,
 stream=null))
java.io.IOException: Exception reading ledger list from zk
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
at 
org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
at 
org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
at 
org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
... 16 more
2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
SHUTDOWN_MSG: 


ZK Data


[zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
-40;59;116
cZxid = 0x2be
ctime = Thu May 17 22:15:03 IST 2012
mZxid = 0x2be
mtime = Thu May 17 22:15:03 IST 2012
pZxid = 0x2be
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 10
numChildren = 0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278610#comment-13278610
 ] 

Colin Patrick McCabe commented on HDFS-2982:


There are lots and lots of unit tests would have to change if 
EditLogInputStream started requiring an init() call.  Not to mention the subtle 
bugs that might crop up.  That alone would almost be worth its own patch.  
Let's deal with this later if we decide it's something worth doing.  Frankly, I 
would argue against it because I think there's better APIs we could design.  In 
particular, an API which separates the concept of a stream from the concept of 
a stream location is much more efficient and results in cleaner code, because 
the invariant that you can't use something without initializing it is then 
enforced by the type system.  So basically, can we revisit this idea later, as 
in after this week?

bq. The new test case is missing the @Test annotation so it won't run.

Will fix.

bq. Are the changes to validateEditLog necessary here? And the change to how 
corrupt files are handled?

It's often really time consuming to change these things because then I have to 
redo all the unit tests.  Still, I will take a look at it.

> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278609#comment-13278609
 ] 

Hudson commented on HDFS-3440:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2286 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2286/])
HDFS-3440. More effectively limit stream memory consumption when reading 
corrupt edit logs. Contributed by Colin Patrick McCabe. (Revision 1339978)

 Result = FAILURE
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339978
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperEditLogInputStream.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogBackupInputStream.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/StreamLimiter.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSEditLogLoader.java


> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278591#comment-13278591
 ] 

Todd Lipcon commented on HDFS-2982:
---

Hi Colin. Many of my comments from HDFS-3049 still apply (eg about the lazy 
initialization of the reader stream)

Are the changes to validateEditLog necessary here? And the change to how 
corrupt files are handled? It seems like they fit more appropriately into 
HDFS-3049. I think you should be able to separate those out from this 
performance fix.

The new test case is missing the @Test annotation so it won't run.



> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-744) Support hsync in HDFS

2012-05-17 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HDFS-744:
---

Attachment: HDFS-744-trunk-v5.patch

New version of the patch.

* Implemented all of Nicholas suggestions.
* Added some simple tests.
* Added a flushFS() method to SequeceFile.Writer.

I would still prefer to implement to hsync() as flushOrSync(syncBlock) rather 
than flushOrSync(true), but this works too.

> Support hsync in HDFS
> -
>
> Key: HDFS-744
> URL: https://issues.apache.org/jira/browse/HDFS-744
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: data-node, hdfs client
>Reporter: Hairong Kuang
>Assignee: Lars Hofhansl
> Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, 
> HDFS-744-trunk-v4.patch, HDFS-744-trunk-v5.patch, HDFS-744-trunk.patch, 
> hdfs-744-v2.txt, hdfs-744-v3.txt, hdfs-744.txt
>
>
> HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, 
> the real expected semantics should be "flushes out to all replicas and all 
> replicas have done posix fsync equivalent - ie the OS has flushed it to the 
> disk device (but the disk may have it in its cache)." This jira aims to 
> implement the expected behaviour.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-3440:
--

  Resolution: Fixed
   Fix Version/s: 2.0.1
Target Version/s: 2.0.1
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278582#comment-13278582
 ] 

Eli Collins commented on HDFS-2982:
---

Hey Colin,

Took a quick look. How about describing the high-level approach in the patch?
- The javadoc for JournalSet#selectInputStreams is a little over-simplified =) 
- how about describing the algorithm (get the streams starting with fromTxid 
from all managers, return a list sorted by the starting txid etc) 
- In EditLogFileInputStream#init why only close the stream that threw?
- In TestEditLog readAllEdits is dead code



> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278581#comment-13278581
 ] 

Todd Lipcon commented on HDFS-3440:
---

+1, looks good to me. Will commit this momentarily.

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278578#comment-13278578
 ] 

Hadoop QA commented on HDFS-3440:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527995/HDFS-3440.002.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2469//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2469//console

This message is automatically generated.

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278575#comment-13278575
 ] 

Hadoop QA commented on HDFS-2982:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527992/HDFS-2982.001.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2468//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2468//console

This message is automatically generated.

> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278574#comment-13278574
 ] 

Hadoop QA commented on HDFS-3049:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527991/HDFS-3049.021.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2467//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2467//console

This message is automatically generated.

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch, HDFS-3049.021.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly

[jira] [Assigned] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins reassigned HDFS-2982:
-

Assignee: Colin Patrick McCabe  (was: Todd Lipcon)

> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-2982:
--

 Target Version/s: 2.0.1  (was: HA branch (HDFS-1623), 0.24.0)
Affects Version/s: (was: 0.24.0)
   2.0.0

> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3440:
---

Attachment: HDFS-3440.002.patch

* add test

* address todd's comments

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch, HDFS-3440.002.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278547#comment-13278547
 ] 

Colin Patrick McCabe commented on HDFS-3049:


FYI: I'm posting the patch for the startup performance stuff over at HDFS-2982.

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch, HDFS-3049.021.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-2982:
---

Attachment: HDFS-2982.001.patch

> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.24.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2982) Startup performance suffers when there are many edit log segments

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-2982:
---

Status: Patch Available  (was: Open)

> Startup performance suffers when there are many edit log segments
> -
>
> Key: HDFS-2982
> URL: https://issues.apache.org/jira/browse/HDFS-2982
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.24.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: HDFS-2982.001.patch
>
>
> For every one of the edit log segments, it seems like we are calling 
> listFiles on the edit log directory inside of {{findMaxTransaction}}. This is 
> killing performance, especially when there are many log segments and the 
> directory is stored on NFS. It is taking several minutes to start up the NN 
> when there are several thousand log segments present.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3049:
---

Attachment: HDFS-3049.021.patch

smaller patch which strips out RedundantEditLogStream, StreamLimiter

Fixed some comments, addressed todd's comments.  Many Log.info messages changed 
to debug, etc.

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch, HDFS-3049.021.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-744) Support hsync in HDFS

2012-05-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278544#comment-13278544
 ] 

Lars Hofhansl commented on HDFS-744:


Alternatively we could add flushFS() to SequenceFile.Writer.

> Support hsync in HDFS
> -
>
> Key: HDFS-744
> URL: https://issues.apache.org/jira/browse/HDFS-744
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: data-node, hdfs client
>Reporter: Hairong Kuang
>Assignee: Lars Hofhansl
> Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, 
> HDFS-744-trunk-v4.patch, HDFS-744-trunk.patch, hdfs-744-v2.txt, 
> hdfs-744-v3.txt, hdfs-744.txt
>
>
> HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, 
> the real expected semantics should be "flushes out to all replicas and all 
> replicas have done posix fsync equivalent - ie the OS has flushed it to the 
> disk device (but the disk may have it in its cache)." This jira aims to 
> implement the expected behaviour.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278505#comment-13278505
 ] 

Hadoop QA commented on HDFS-3415:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527967/HDFS-3415.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2466//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2466//console

This message is automatically generated.

> NameNode is getting shutdown by throwing nullpointer exception when one of 
> the layout version is different with others(Multiple storage dirs are 
> configured)
> 
>
> Key: HDFS-3415
> URL: https://issues.apache.org/jira/browse/HDFS-3415
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0, 3.0.0
> Environment: Suse linux + jdk 1.6
>Reporter: Brahma Reddy Battula
>Assignee: Brandon Li
> Attachments: HDFS-3415.patch
>
>
> Scenario:
> =
> start Namenode and datanode by configuring three storage dir's for namenode
> write 10 files
> edit version file of one of the storage dir and give layout version as 123 
> which different with default(-40).
> Stop namenode
> start Namenode.
> Then I am getting follwong exception...
> {noformat}
> 2012-05-13 19:01:41,483 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151)
> 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278500#comment-13278500
 ] 

Hadoop QA commented on HDFS-3440:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527968/HDFS-3440.001.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2465//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2465//console

This message is automatically generated.

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-744) Support hsync in HDFS

2012-05-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278488#comment-13278488
 ] 

Lars Hofhansl commented on HDFS-744:


Thanks Nicholas!

I'll rename the flag and the flush method.

The reason for not calling flush(true) was two-fold:
# Code that currently uses hsync would suddenly get the new behavior. For 
example HBase which uses this via a SequenceFile.Writer would have no option to 
disable this (unless we expose a new flag to Writer.syncFS).
# Without SYNC_BLOCK it kinda makes no sense (or would at least set the wrong 
expectation that everything is sync'ed to disk).

Sorry about the tabs, I had used my default eclipse formatter.
Will look at the Append tests and some new ones for hsync.


> Support hsync in HDFS
> -
>
> Key: HDFS-744
> URL: https://issues.apache.org/jira/browse/HDFS-744
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: data-node, hdfs client
>Reporter: Hairong Kuang
>Assignee: Lars Hofhansl
> Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, 
> HDFS-744-trunk-v4.patch, HDFS-744-trunk.patch, hdfs-744-v2.txt, 
> hdfs-744-v3.txt, hdfs-744.txt
>
>
> HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, 
> the real expected semantics should be "flushes out to all replicas and all 
> replicas have done posix fsync equivalent - ie the OS has flushed it to the 
> disk device (but the disk may have it in its cache)." This jira aims to 
> implement the expected behaviour.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278472#comment-13278472
 ] 

Todd Lipcon commented on HDFS-3440:
---

- StreamLImiter either needs to be package-private or marked with a private 
interface annotation
- we don't generally mark interface methods as "abstract". In fact I didn't 
know that was legal java
- can you refactor out the code that checks curPos+len against the limit into a 
{{checkLimit(int bytesToRead);}} call?
- would be good to add a simple unit test of this functionality - eg construct 
a FSEditLogOp.Reader and give it a header which would cause it to try to read 
more than MAX_OP_SIZE, verify it throws the expected exception.

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278460#comment-13278460
 ] 

Colin Patrick McCabe commented on HDFS-3440:


(the "failure to apply patch" thing is related to the earlier patch I posted 
and then took down)

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3330) If GetImageServlet throws an Error or RTE, response has HTTP "OK" status

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-3330:
--

  Resolution: Fixed
   Fix Version/s: 1.1.0
Target Version/s:   (was: 1.1.0, 2.0.0)
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Todd, nope. I committed to branch-1. Thanks!

> If GetImageServlet throws an Error or RTE, response has HTTP "OK" status
> 
>
> Key: HDFS-3330
> URL: https://issues.apache.org/jira/browse/HDFS-3330
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 1.0.0, 2.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: hdfs-3330.txt
>
>
> Currently in GetImageServlet, we catch Exception but not other Errors or 
> RTEs. So, if the code ends up throwing one of these exceptions, the 
> "response.sendError()" code doesn't run, but the finally clause does run. 
> This results in the servlet returning HTTP 200 OK and an empty response, 
> which causes the client to think it got a successful image transfer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278457#comment-13278457
 ] 

Hadoop QA commented on HDFS-3440:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527964/number1.001.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javac.  The patch appears to cause the build to fail.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2464//console

This message is automatically generated.

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3440:
---

Attachment: (was: number1.001.patch)

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)

2012-05-17 Thread Brandon Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-3415:
-

Attachment: HDFS-3415.patch

Instead of allowing namenode to move forward with multiple layout versions, the 
storage inspector selector (NNStroage.readAndInspectDirs) should throw 
exception saying the inconsistent layout versions.


> NameNode is getting shutdown by throwing nullpointer exception when one of 
> the layout version is different with others(Multiple storage dirs are 
> configured)
> 
>
> Key: HDFS-3415
> URL: https://issues.apache.org/jira/browse/HDFS-3415
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0, 3.0.0
> Environment: Suse linux + jdk 1.6
>Reporter: Brahma Reddy Battula
>Assignee: Brandon Li
> Attachments: HDFS-3415.patch
>
>
> Scenario:
> =
> start Namenode and datanode by configuring three storage dir's for namenode
> write 10 files
> edit version file of one of the storage dir and give layout version as 123 
> which different with default(-40).
> Stop namenode
> start Namenode.
> Then I am getting follwong exception...
> {noformat}
> 2012-05-13 19:01:41,483 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151)
> 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)

2012-05-17 Thread Brandon Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-3415:
-

Status: Patch Available  (was: Open)

> NameNode is getting shutdown by throwing nullpointer exception when one of 
> the layout version is different with others(Multiple storage dirs are 
> configured)
> 
>
> Key: HDFS-3415
> URL: https://issues.apache.org/jira/browse/HDFS-3415
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0, 3.0.0
> Environment: Suse linux + jdk 1.6
>Reporter: Brahma Reddy Battula
>Assignee: Brandon Li
> Attachments: HDFS-3415.patch
>
>
> Scenario:
> =
> start Namenode and datanode by configuring three storage dir's for namenode
> write 10 files
> edit version file of one of the storage dir and give layout version as 123 
> which different with default(-40).
> Stop namenode
> start Namenode.
> Then I am getting follwong exception...
> {noformat}
> 2012-05-13 19:01:41,483 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151)
> 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3440:
---

Attachment: HDFS-3440.001.patch

some unrelated stuff got mixed into the last patch

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3440.001.patch, number1.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3440:
---

Status: Patch Available  (was: Open)

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: number1.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HDFS-3440:
--

 Summary: should more effectively limit stream memory consumption 
when reading corrupt edit logs
 Key: HDFS-3440
 URL: https://issues.apache.org/jira/browse/HDFS-3440
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
Priority: Minor
 Attachments: number1.001.patch

Currently, we do in.mark(100MB) before reading an opcode out of the edit log.  
However, this could result in us usin all of those 100 MB when reading bogus 
data, which is not what we want.  It also could easily make some corrupt edit 
log files unreadable.

We should have a stream limiter interface, that causes a clean IOException when 
we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3440) should more effectively limit stream memory consumption when reading corrupt edit logs

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3440:
---

Attachment: number1.001.patch

> should more effectively limit stream memory consumption when reading corrupt 
> edit logs
> --
>
> Key: HDFS-3440
> URL: https://issues.apache.org/jira/browse/HDFS-3440
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: number1.001.patch
>
>
> Currently, we do in.mark(100MB) before reading an opcode out of the edit log. 
>  However, this could result in us usin all of those 100 MB when reading bogus 
> data, which is not what we want.  It also could easily make some corrupt edit 
> log files unreadable.
> We should have a stream limiter interface, that causes a clean IOException 
> when we're in this situation, and does not result in huge memory consumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-744) Support hsync in HDFS

2012-05-17 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278444#comment-13278444
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-744:
-

Thanks a lot, Lars!  The patch looks good.  Some comments:

- Let's name the new CreateFlag as SYNC_BLOCK instead of FORCE.  POSIX uses 
SYNC as you mentioned but POSIX SYNC means syncing every write.

- DFSOutputStream.hsync(),
-* It should call flush(true).  It is better to sync the current block then not 
syncing at all.
-* Need to update the javadoc to say that it only sync the current block.

- Rename flush(force) to flushOrSync(isSync) in BlockReceiver and 
DFSOutputStream.  Please also update the javadoc.

- We do not use tabs in Hadoop.  Indentation should use two spaces.

- Please add some new tests.  It is not easy to test whether sync actually 
works but at least add some new test to call hsync().  See TestFileAppend and 
TestFileAppend[234] to get some ideas.

> Support hsync in HDFS
> -
>
> Key: HDFS-744
> URL: https://issues.apache.org/jira/browse/HDFS-744
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: data-node, hdfs client
>Reporter: Hairong Kuang
>Assignee: Lars Hofhansl
> Attachments: HDFS-744-trunk-v2.patch, HDFS-744-trunk-v3.patch, 
> HDFS-744-trunk-v4.patch, HDFS-744-trunk.patch, hdfs-744-v2.txt, 
> hdfs-744-v3.txt, hdfs-744.txt
>
>
> HDFS-731 implements hsync by default as hflush. As descriibed in HADOOP-6313, 
> the real expected semantics should be "flushes out to all replicas and all 
> replicas have done posix fsync equivalent - ie the OS has flushed it to the 
> disk device (but the disk may have it in its cache)." This jira aims to 
> implement the expected behaviour.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3415) NameNode is getting shutdown by throwing nullpointer exception when one of the layout version is different with others(Multiple storage dirs are configured)

2012-05-17 Thread Brandon Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278409#comment-13278409
 ] 

Brandon Li commented on HDFS-3415:
--

I reproduce this problem with the following configuration:
1. set two storage directories, say dirA and dirB
2. start and then shutdown namenode 
3. change only dirB's layout version from -40 to 123.
4. start namenode and it should fail with the above NullPointerException

The problem here is:

Two storage inspectors are used in namenode, 
FSImagePreTransactionalStorageInspector for layout version before -38, and 
FSImageTransactionalStorageInspector for -38 or anything later.

In this case, the modified storage directory happens to be the last one 
inspected by the namenode in order to load image/edits. Even though it sees two 
layout version, it saves the last one ("123" in this case) as the storage 
layout version. However, it uses FSImageTransactionalStorageInspector to get 
image path because dirA still has -40 and then uses 
FSImagePreTransactionalStorageInspector to get edit stream. Because 
FSImagePreTransactionalStorageInspector can't recognize
 the file in a storage directory whose real version is newer, some references 
are not initialized which eventually cause the exception.



> NameNode is getting shutdown by throwing nullpointer exception when one of 
> the layout version is different with others(Multiple storage dirs are 
> configured)
> 
>
> Key: HDFS-3415
> URL: https://issues.apache.org/jira/browse/HDFS-3415
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0, 3.0.0
> Environment: Suse linux + jdk 1.6
>Reporter: Brahma Reddy Battula
>Assignee: Brandon Li
>
> Scenario:
> =
> start Namenode and datanode by configuring three storage dir's for namenode
> write 10 files
> edit version file of one of the storage dir and give layout version as 123 
> which different with default(-40).
> Stop namenode
> start Namenode.
> Then I am getting follwong exception...
> {noformat}
> 2012-05-13 19:01:41,483 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:261)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:276)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:368)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:402)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:564)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:545)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1093)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1151)
> 2012-05-13 19:01:41,485 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2276) src/test/unit tests not being run in mavenized HDFS

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-2276:
--

 Target Version/s: 2.0.1
Affects Version/s: (was: 0.23.0)
   2.0.0

> src/test/unit tests not being run in mavenized HDFS
> ---
>
> Key: HDFS-2276
> URL: https://issues.apache.org/jira/browse/HDFS-2276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-2276.txt
>
>
> There are about 5 tests in src/test/unit that are no longer being run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278397#comment-13278397
 ] 

Hudson commented on HDFS-3391:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2282 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2282/])
HDFS-3391. Fix InvalidateBlocks to compare blocks including their 
generation stamps. Contributed by Todd Lipcon. (Revision 1339897)

 Result = FAILURE
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339897
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightHashSet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightLinkedSet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestLightWeightHashSet.java


> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2956) calling fetchdt without a --renewer argument throws NPE

2012-05-17 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278396#comment-13278396
 ] 

Aaron T. Myers commented on HDFS-2956:
--

Hey Daryn, any update here? I just bumped into this myself.

> calling fetchdt without a --renewer argument throws NPE
> ---
>
> Key: HDFS-2956
> URL: https://issues.apache.org/jira/browse/HDFS-2956
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Affects Versions: 0.24.0
>Reporter: Todd Lipcon
>
> If I call "bin/hdfs fetchdt /tmp/mytoken" without a "--renewer foo" argument, 
> then it will throw a NullPointerException:
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:830)
> this is because getDelegationToken is being called with a null renewer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3439) Balancer exits if fs.defaultFS is set to a different, but semantically identical, URI from dfs.namenode.rpc-address

2012-05-17 Thread Aaron T. Myers (JIRA)
Aaron T. Myers created HDFS-3439:


 Summary: Balancer exits if fs.defaultFS is set to a different, but 
semantically identical, URI from dfs.namenode.rpc-address
 Key: HDFS-3439
 URL: https://issues.apache.org/jira/browse/HDFS-3439
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 2.0.0
Reporter: Aaron T. Myers


The balancer determines the set of NN URIs to balance by looking at 
fs.defaultFS and all possible dfs.namenode.(service)rpc-address settings. If 
fs.defaultFS is, for example, set to "hdfs://foo.example.com:8020/" (note the 
trailing "/") and the rpc-address is set to "hdfs://foo.example.com:8020" 
(without a "/"), then the balancer will conclude that there are two NNs and try 
to balance both. However, since both of these URIs refer to the same actual FS 
instance, the balancer will exit with "java.io.IOException: Another balancer is 
running.  Exiting ..."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278361#comment-13278361
 ] 

Hudson commented on HDFS-3391:
--

Integrated in Hadoop-Common-trunk-Commit #2264 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2264/])
HDFS-3391. Fix InvalidateBlocks to compare blocks including their 
generation stamps. Contributed by Todd Lipcon. (Revision 1339897)

 Result = SUCCESS
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339897
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightHashSet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightLinkedSet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestLightWeightHashSet.java


> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278359#comment-13278359
 ] 

Hudson commented on HDFS-3391:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2337 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2337/])
HDFS-3391. Fix InvalidateBlocks to compare blocks including their 
generation stamps. Contributed by Todd Lipcon. (Revision 1339897)

 Result = SUCCESS
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339897
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightHashSet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/LightWeightLinkedSet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockRecovery.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestLightWeightHashSet.java


> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-3391:
--

   Resolution: Fixed
Fix Version/s: 2.0.1
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278346#comment-13278346
 ] 

Todd Lipcon commented on HDFS-3391:
---

Thanks Nicholas. The two javadoc warnings above are due to gridmix, so 
unrelated to this patch. I'll commit this momentarily.

> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278341#comment-13278341
 ] 

Colin Patrick McCabe commented on HDFS-3049:


bq. Also, I'm not sure why the exception is swallowed instead of rethrown. If 
it fails to open the edit log, shouldn't that generate an exception on init()? 
Should we make init() a public interface (eg "open()") instead, so that the 
caller is cognizant of the flow here, instead of doing it lazily? I think that 
would also simplify the other functions, which could just do a 
Preconditions.checkState(state == State.OPEN) instead of always handling 
lazy-init.

You're right-- the exception should be thrown if resync == false.

As for creating a public init() method-- I guess, but this patch is getting 
kind of big already.  Perhaps we could file a separate JIRA for that?  I also 
have some other API ideas that might improve efficiency (not going to discuss 
them here due to space constraints)

bq. Why do you close() here but not close() in the normal case where you reach 
the end of the log? It seems it should be up to the caller to close upon 
hitting the "eof" (null txn) either way.

The rationale behind this was discussed in HDFS-3335.  Basically, if there is 
corruption at the end of the log, but we read everything we were supposed to, 
we don't want to throw an exception.  As for closing in the eof case, that 
seems unecessary.  The caller has to call close() anyway, that's part of the 
contract for this API.  So we don't really add any value by doing it 
automatically.

bq. again, why not just catch Throwable?

Yeah, we should do that.  Will fix.

bq. IllegalStateException would be more appropriate here

ok

[streamlimiter comments]

agree with most/all of this.  I think this can be separated out (probably)

[log message comments]

yes, probably some of those should be debug comments.  Probably at least the 
ones which just describe "situation normal, added new stream" etc.

[separate into 3 patches]
well, it's worth a try.  There are some non-obvious dependencies, but I'll give 
it a try.

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about 

[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278334#comment-13278334
 ] 

Hadoop QA commented on HDFS-3391:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527912/hdfs-3391.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 2 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2463//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2463//console

This message is automatically generated.

> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278317#comment-13278317
 ] 

Todd Lipcon commented on HDFS-3049:
---

{code}
+  } catch (RuntimeException e) {
+LOG.error("caught exception initializing " + this, e);
+state = State.CLOSED;
+return null;
+  } catch (IOException e) {
+LOG.error("caught exception initializing " + this, e);
+state = State.CLOSED;
+return null;
+  }
{code}

Why not simply catch Throwable here?

Also, I'm not sure why the exception is swallowed instead of rethrown. If it 
fails to open the edit log, shouldn't that generate an exception on init()? 
Should we make init() a public interface (eg "open()") instead, so that the 
caller is cognizant of the flow here, instead of doing it lazily? I think that 
would also simplify the other functions, which could just do a 
Preconditions.checkState(state == State.OPEN) instead of always handling 
lazy-init.


{code}
+  LOG.info("skipping the rest of " + this + " since we " +
+  "reached txId " + txId);
+  close();
{code}
Why do you close() here but not close() in the normal case where you reach the 
end of the log? It seems it should be up to the caller to close upon hitting 
the "eof" (null txn) either way.


{code}
 try {
-  return reader.readOp(true);
+  return nextOpImpl(true);
 } catch (IOException e) {
+  LOG.error("nextValidOp: got exception while reading " + this, e);
+  return null;
+} catch (RuntimeException e) {
+  LOG.error("nextValidOp: got exception while reading " + this, e);
   return null;
 }
{code}
again, why not just catch Throwable?


{code}
+if (!streams.isEmpty()) {
+  String error = String.format("Cannot start writing at txid %s " +
+"when there is a stream available for read: %s",
+segmentTxId, streams.get(0));
+  IOUtils.cleanup(LOG, streams.toArray(new EditLogInputStream[0]));
+  throw new RuntimeException(error);
 }
{code}
IllegalStateException would be more appropriate here



Changes to PositionTrackingInputStream: can you refactor out a function like 
{{checkLimit(int amountToRead);}} here? Lots of duplicate code.

Why is the opcode size changed from 100MB to 1.5MB? Didn't you just change it 
to 100MB recently?

Also, why is this change to the limiting behavior lumped in here? It's hard to 
review when the patch has a lot of distinct changes put together.

StreamLimiter needs an interface annotation, or be made package private.



- There are a lot of new LOG.info messages which look more like they should be 
debug level. I don't think the operator would be able to make sense of all this 
output.




How hard would it be to separate this into three patches?
1) Bug fix which uses the new StreamLimiter to fix the issue you mentioned 
higher up in the comments (and seems distinct from the rest)
2) Change the API to get rid of getInputStream() and fix the O(n^2) behavior
3) Introduce RedundantInputStream to solve the issue described in this JIRA

I think there really are three separate things going on here and the 120KB 
patch is difficult to digest.

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug 

[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278298#comment-13278298
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3391:
--

+1 patch looks good.  Thanks.

> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278297#comment-13278297
 ] 

Colin Patrick McCabe commented on HDFS-3049:


(noe: the javadoc warnings relate to gridmx and were not introduced by this 
change)

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3373) FileContext HDFS implementation can leak socket caches

2012-05-17 Thread John George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John George reassigned HDFS-3373:
-

Assignee: John George

> FileContext HDFS implementation can leak socket caches
> --
>
> Key: HDFS-3373
> URL: https://issues.apache.org/jira/browse/HDFS-3373
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: John George
>
> As noted by Nicholas in HDFS-3359, FileContext doesn't have a close() method, 
> and thus never calls DFSClient.close(). This means that, until finalizers 
> run, DFSClient will hold on to its SocketCache object and potentially have a 
> lot of outstanding sockets/fds held on to.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-3391:
--

Attachment: hdfs-3391.txt

Attached patch addresses Nicholas's comments above. The one I did not address 
was to remove the TODO that references HDFS-2668. Since that TODO is not 
addressed by this JIRA, I think it's better to address it in HDFS-2668 itself.

> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-3391.txt, hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278212#comment-13278212
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3391:
--

The patch looks good.  Some comments:

- All calls of invalidateBlocks.contains(..) have matchGenStamp == true.  How 
about remove matchGenStamp from the parameters?

- Remove the TODO below.
{code}
-if(invalidateBlocks.contains(dn.getStorageID(), block)) {
+if(invalidateBlocks.contains(dn.getStorageID(), block, true)) {
 /*  TODO: following assertion is incorrect, see HDFS-2668
 assert storedBlock.findDatanode(dn) < 0 : "Block " + block
 + " in recentInvalidatesSet should not appear in DN " + dn; */
{code}

- In LightWeightHashSet.getEqualElement(..), since the key will be cast to T, 
change the type to T and remove @SuppressWarnings("unchecked").  Then, we need 
to cast the key to T in contains(..).  Add @Override to contains(..).  Also, 
how about renaming getEqualElement to getElement?





> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278200#comment-13278200
 ] 

Hadoop QA commented on HDFS-3049:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527878/HDFS-3049.018.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 10 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2462//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2462//console

This message is automatically generated.

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default

[jira] [Commented] (HDFS-3437) Remove name.node.address servlet attribute

2012-05-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278168#comment-13278168
 ] 

Hadoop QA commented on HDFS-3437:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527871/hdfs-3437.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified test 
files.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 javadoc.  The javadoc tool appears to have generated 2 warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2461//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2461//console

This message is automatically generated.

> Remove name.node.address servlet attribute
> --
>
> Key: HDFS-3437
> URL: https://issues.apache.org/jira/browse/HDFS-3437
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-3437.txt
>
>
> Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY 
> since we always call DfsServlet#createNameNodeProxy within the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3438) BootstrapStandby should not require a rollEdits on active node

2012-05-17 Thread Todd Lipcon (JIRA)
Todd Lipcon created HDFS-3438:
-

 Summary: BootstrapStandby should not require a rollEdits on active 
node
 Key: HDFS-3438
 URL: https://issues.apache.org/jira/browse/HDFS-3438
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha
Affects Versions: 2.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon


Currently, BootstrapStandby uses a rollEdits() call on the active NN in order 
to determine the most recent checkpoint and transaction ID. However, this means 
that you cannot bootstrap a standby when the active NN is in safe mode -- i.e 
you have to start the whole cluster before you can bootstrap a new standby. 
This makes the workflow to convert an existing cluster to HA more complicated. 
We should allow BootstrapStandby to work even when the NN is in safe mode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2617) Replaced Kerberized SSL for image transfer and fsck with SPNEGO-based solution

2012-05-17 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278138#comment-13278138
 ] 

Aaron T. Myers commented on HDFS-2617:
--

bq. Should this be marked resolved or are we leaving it open for commit to 1.1?

I'd like to put some version of this patch in 1.1, perhaps with a config option 
to continue to use KSSL if one wants so we don't necessarily break deployments 
that are currently successfully using KSSL.

Perhaps we should resolve this one and open a new JIRA along the lines of 
"Back-port HDFS-2617 to branch-1" ?

> Replaced Kerberized SSL for image transfer and fsck with SPNEGO-based solution
> --
>
> Key: HDFS-2617
> URL: https://issues.apache.org/jira/browse/HDFS-2617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: Jakob Homan
>Assignee: Jakob Homan
> Fix For: 2.0.0
>
> Attachments: HDFS-2617-a.patch, HDFS-2617-b.patch, 
> HDFS-2617-config.patch, HDFS-2617-trunk.patch, HDFS-2617-trunk.patch, 
> HDFS-2617-trunk.patch, HDFS-2617-trunk.patch
>
>
> The current approach to secure and authenticate nn web services is based on 
> Kerberized SSL and was developed when a SPNEGO solution wasn't available. Now 
> that we have one, we can get rid of the non-standard KSSL and use SPNEGO 
> throughout.  This will simplify setup and configuration.  Also, Kerberized 
> SSL is a non-standard approach with its own quirks and dark corners 
> (HDFS-2386).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278124#comment-13278124
 ] 

Hudson commented on HDFS-2800:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2280 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2280/])
HDFS-2800. Fix cancellation of checkpoints in the standby node to be more 
reliable. Contributed by Todd Lipcon. (Revision 1339745)

 Result = ABORTED
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339745
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SaveNamespaceContext.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/Canceler.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java


> HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
> -
>
> Key: HDFS-2800
> URL: https://issues.apache.org/jira/browse/HDFS-2800
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, test
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: 2.0.1
>
> Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, 
> hdfs-2800.txt
>
>
> TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the 
> following assert on line 212 fail:
> {code}
> assertTrue(StandbyCheckpointer.getCanceledCount() > 0);
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.

2012-05-17 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278105#comment-13278105
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3436:
--

Good catch!  I think the bug is in the following:
{code}
//DataNode.transferReplicaForPipelineRecovery(..)
synchronized(data) {
  if (data.isValidRbw(b)) {
stage = BlockConstructionStage.TRANSFER_RBW;
  } else if (data.isValidBlock(b)) {
stage = BlockConstructionStage.TRANSFER_FINALIZED;
  } else {
final String r = data.getReplicaString(b.getBlockPoolId(), 
b.getBlockId());
throw new IOException(b + " is neither a RBW nor a Finalized, r=" + r);
  }

  storedGS = data.getStoredBlock(b.getBlockPoolId(),
  b.getBlockId()).getGenerationStamp();
  if (storedGS < b.getGenerationStamp()) {
throw new IOException(
storedGS + " = storedGS < b.getGenerationStamp(), b=" + b);
  }
  visible = data.getReplicaVisibleLength(b);
}
{code}
It should first call getStoredBlock(..) and then use the stored block to call 
isValidRbw(..) and isValidBlock(..).  It expects GS to be updated but does not 
handle it correctly.

> Append to file is failing when one of the datanode where the block present is 
> down.
> ---
>
> Key: HDFS-3436
> URL: https://issues.apache.org/jira/browse/HDFS-3436
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 2.0.0
>Reporter: Brahma Reddy Battula
>Assignee: Vinay
>
> Scenario:
> =
> 1. Cluster with 4 DataNodes.
> 2. Written file to 3 DNs, DN1->DN2->DN3
> 3. Stopped DN3,
> Now Append to file is failing due to addDatanode2ExistingPipeline is failed.
>  *CLinet Trace* 
> {noformat}
> 2012-04-24 22:06:09,947 INFO  hdfs.DFSClient 
> (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in 
> createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as ***:50010
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:09,947 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery 
> for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 
> in pipeline *:50010, **:50010, *:50010: bad datanode **:50010
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient (DFSOutputStream.java:run(549)) 
> - DataStreamer Exception
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:hflush(1515)) - Error while syncing
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> {noformat}
>  *DataNode Trace*  
> {noformat}
> 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver

[jira] [Commented] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278079#comment-13278079
 ] 

Colin Patrick McCabe commented on HDFS-3049:


general description of the changes between 15 and 17/18 (sorry for not posting 
this earlier, it was late):

* update some unit tests.  In particular 
TestFileJournalManager#getNumberOfTransactions now takes a parameter that 
specifies whether it stops counting transactions at a gap, or not.  Some 
functions in the test want one behavior or the other.

* fix an off-by-one error in checkForGaps.

* remove some deadcode that was causing a findbugs warning

* fix a case where String.format was getting the wrong number of args

* fix validation of files with corrupt headers (basically, force a header read).

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3049) During the normal loading NN startup process, fall back on a different EditLog if we see one that is corrupt

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3049:
---

Attachment: HDFS-3049.018.patch

* fix mockito stuff in TestFailureToReadEdits

> During the normal loading NN startup process, fall back on a different 
> EditLog if we see one that is corrupt
> 
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch, 
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch, 
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch, 
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch, 
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch, 
> HDFS-3049.018.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit 
> logs to it until we believe that we have all the latest changes.  
> Unfortunately, if there is an I/O error while reading any of these files, in 
> most cases, we simply abort the startup process.  We should try harder to 
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a 
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to 
> pre-reserve blocks, a disk full condition may cause a truncated log in one 
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories 
> receiving corrupt data, but not all.  This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we 
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the 
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ  size:213421345   md5:2345345
> image  size:213421345   md5:2345345
> 2. /foo/baz
> edits__curent__XYZ  size:213421345   md5:2345345345
> image  size:213421345   md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling.  In 
> this case, this means that we should NOT fail if we can't find EditLogs that 
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the 
> NameNode name directory or directories hangs because of an I/O error.  This 
> may happen, for example, when trying to load an image from a hard-mounted NFS 
> directory, when the NFS server has gone away.  Just as now, the operator will 
> have to notice this problem and take steps to correct it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278070#comment-13278070
 ] 

Hudson commented on HDFS-3434:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2279 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2279/])
HDFS-3434. InvalidProtocolBufferException when visiting DN 
browseDirectory.jsp. Contributed by Eli Collins (Revision 1339712)

 Result = ABORTED
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339712
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeHttpServer.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestJspHelper.java


> InvalidProtocolBufferException when visiting DN browseDirectory.jsp
> ---
>
> Key: HDFS-3434
> URL: https://issues.apache.org/jira/browse/HDFS-3434
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt
>
>
> The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When 
> selecting the "haus04" under the "Node" table I get a link with the http 
> address which is bogus (the wildcard/http port not the nn rpc addr), which 
> results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to 
> 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: 
> Connection refused". The browse this file system link works.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2936) File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication

2012-05-17 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278067#comment-13278067
 ] 

Harsh J commented on HDFS-2936:
---

Nicholas,

bq. Thanks for updating the description. Are you suggesting to use 
dfs.namenode.replication.min for client-side check and use 
dfs.namenode.replication.min.for.write for server-side check?

Sort of. The former (or whatever replaces the former) should only check file 
replication factor short-values, which is applied/changed during 
create/setReplicationFactor alone. Not live block count. This is still a 
server-side-check. Client side checks would be of no good to an admin.

bq. BTW, "File close()-ing hangs indefinitely if the number of live blocks does 
not match the minimum replication" is the original design of 
dfs.namenode.replication.min. I think we should not change it.

True that that was the intention. A non-behavior changing patch can also be 
made (wherein default of the for.write property will be what the original min 
property is). But lets at least provide a way for admins to enforce minimum 
replication _factors_ on files, without having to worry about pipelines and 
what not - if an admin so wishes to.

Setting {{dfs.replication}} to final does not work, cause there are create() 
API calls and setrep() calls that bypass/disregard that config. Essentially 
thats what lead us down this path - to use minimum, but just at meta-level, not 
live-block level (as it is today).

> File close()-ing hangs indefinitely if the number of live blocks does not 
> match the minimum replication
> ---
>
> Key: HDFS-2936
> URL: https://issues.apache.org/jira/browse/HDFS-2936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HDFS-2936.patch
>
>
> If an admin wishes to enforce replication today for all the users of their 
> cluster, he may set {{dfs.namenode.replication.min}}. This property prevents 
> users from creating files with < expected replication factor.
> However, the value of minimum replication set by the above value is also 
> checked at several other points, especially during completeFile (close) 
> operations. If a condition arises wherein a write's pipeline may have gotten 
> only < minimum nodes in it, the completeFile operation does not successfully 
> close the file and the client begins to hang waiting for NN to replicate the 
> last bad block in the background. This form of hard-guarantee can, for 
> example, bring down clusters of HBase during high xceiver load on DN, or disk 
> fill-ups on many of them, etc..
> I propose we should split the property in two parts:
> * dfs.namenode.replication.min
> ** Stays the same name, but only checks file creation time replication factor 
> value and during adjustments made via setrep/etc.
> * dfs.namenode.replication.min.for.write
> ** New property that disconnects the rest of the checks from the above 
> property, such as the checks done during block commit, file complete/close, 
> safemode checks for block availability, etc..
> Alternatively, we may also choose to remove the client-side hang of 
> completeFile/close calls with a set number of retries. This would further 
> require discussion about how a file-closure handle ought to be handled.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3431) Improve fuse-dfs truncate

2012-05-17 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3431:
---

Description: 
Fuse-dfs truncate only works for size == 0 and per the function's comment is a 
"Weak implementation in that we just delete the file and then re-create it, but 
don't set the user, group, and times to the old file's metadata". Per HDFS-860 
we should ENOTSUP when the size != 0 or the size of the file.  

Also, we should implement the ftruncate system call in FUSE.

  was:Fuse-dfs truncate only works for size == 0 and per the function's comment 
is a "Weak implementation in that we just delete the file and then re-create 
it, but don't set the user, group, and times to the old file's metadata". Per 
HDFS-860 we should ENOTSUP when the size != 0 or the size of the file.  


> Improve fuse-dfs truncate
> -
>
> Key: HDFS-3431
> URL: https://issues.apache.org/jira/browse/HDFS-3431
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/fuse-dfs
>Reporter: Eli Collins
>Priority: Minor
>
> Fuse-dfs truncate only works for size == 0 and per the function's comment is 
> a "Weak implementation in that we just delete the file and then re-create it, 
> but don't set the user, group, and times to the old file's metadata". Per 
> HDFS-860 we should ENOTSUP when the size != 0 or the size of the file.  
> Also, we should implement the ftruncate system call in FUSE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3391) TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing

2012-05-17 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278035#comment-13278035
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3391:
--

Hi Todd, thanks for posting a patch.  I will review it later today.  

> TestPipelinesFailover#testLeaseRecoveryAfterFailover is failing
> ---
>
> Key: HDFS-3391
> URL: https://issues.apache.org/jira/browse/HDFS-3391
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Arun C Murthy
>Assignee: Todd Lipcon
>Priority: Critical
> Attachments: hdfs-3391.txt, hdfs-3391.txt
>
>
> Running org.apache.hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.208 sec <<< 
> FAILURE!
> --
> Running org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
> Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 81.195 sec 
> <<< FAILURE!
> --

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278031#comment-13278031
 ] 

Hudson commented on HDFS-2800:
--

Integrated in Hadoop-Common-trunk-Commit #2263 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2263/])
HDFS-2800. Fix cancellation of checkpoints in the standby node to be more 
reliable. Contributed by Todd Lipcon. (Revision 1339745)

 Result = SUCCESS
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339745
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SaveNamespaceContext.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/Canceler.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java


> HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
> -
>
> Key: HDFS-2800
> URL: https://issues.apache.org/jira/browse/HDFS-2800
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, test
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: 2.0.1
>
> Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, 
> hdfs-2800.txt
>
>
> TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the 
> following assert on line 212 fail:
> {code}
> assertTrue(StandbyCheckpointer.getCanceledCount() > 0);
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2936) File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication

2012-05-17 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278032#comment-13278032
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-2936:
--

Hi Harsh,

Thanks for updating the description.  Are you suggesting to use 
dfs.namenode.replication.min for client-side check and use 
dfs.namenode.replication.min.for.write for server-side check?

BTW, "File close()-ing hangs indefinitely if the number of live blocks does not 
match the minimum replication" is the original design of 
dfs.namenode.replication.min.  I think we should not change it.

> File close()-ing hangs indefinitely if the number of live blocks does not 
> match the minimum replication
> ---
>
> Key: HDFS-2936
> URL: https://issues.apache.org/jira/browse/HDFS-2936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HDFS-2936.patch
>
>
> If an admin wishes to enforce replication today for all the users of their 
> cluster, he may set {{dfs.namenode.replication.min}}. This property prevents 
> users from creating files with < expected replication factor.
> However, the value of minimum replication set by the above value is also 
> checked at several other points, especially during completeFile (close) 
> operations. If a condition arises wherein a write's pipeline may have gotten 
> only < minimum nodes in it, the completeFile operation does not successfully 
> close the file and the client begins to hang waiting for NN to replicate the 
> last bad block in the background. This form of hard-guarantee can, for 
> example, bring down clusters of HBase during high xceiver load on DN, or disk 
> fill-ups on many of them, etc..
> I propose we should split the property in two parts:
> * dfs.namenode.replication.min
> ** Stays the same name, but only checks file creation time replication factor 
> value and during adjustments made via setrep/etc.
> * dfs.namenode.replication.min.for.write
> ** New property that disconnects the rest of the checks from the above 
> property, such as the checks done during block commit, file complete/close, 
> safemode checks for block availability, etc..
> Alternatively, we may also choose to remove the client-side hang of 
> completeFile/close calls with a set number of retries. This would further 
> require discussion about how a file-closure handle ought to be handled.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3437) Remove name.node.address servlet attribute

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-3437:
--

Status: Patch Available  (was: Open)

> Remove name.node.address servlet attribute
> --
>
> Key: HDFS-3437
> URL: https://issues.apache.org/jira/browse/HDFS-3437
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-3437.txt
>
>
> Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY 
> since we always call DfsServlet#createNameNodeProxy within the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-3437) Remove name.node.address servlet attribute

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-3437:
--

Attachment: hdfs-3437.txt

Patch attached. 

Not sure about the last case in testGetUgi, eg if NAMENODE_ADDRESS should take 
precedence over DELEGATION_PARAMETER, as eg printGotoForm sets both, and 
previously might have ignored NAMENODE_ADDRESS in favor of 
NAMENODE_ADDRESS_ATTRIBUTE_KEY. Don't think this changes behavior, and the new 
test passes with the current code but worth double checking.

> Remove name.node.address servlet attribute
> --
>
> Key: HDFS-3437
> URL: https://issues.apache.org/jira/browse/HDFS-3437
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-3437.txt
>
>
> Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY 
> since we always call DfsServlet#createNameNodeProxy within the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278021#comment-13278021
 ] 

Hudson commented on HDFS-2800:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2336 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2336/])
HDFS-2800. Fix cancellation of checkpoints in the standby node to be more 
reliable. Contributed by Todd Lipcon. (Revision 1339745)

 Result = SUCCESS
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339745
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SaveNamespaceContext.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/Canceler.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java


> HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
> -
>
> Key: HDFS-2800
> URL: https://issues.apache.org/jira/browse/HDFS-2800
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, test
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: 2.0.1
>
> Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, 
> hdfs-2800.txt
>
>
> TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the 
> following assert on line 212 fail:
> {code}
> assertTrue(StandbyCheckpointer.getCanceledCount() > 0);
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy

2012-05-17 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2800:
--

   Resolution: Fixed
Fix Version/s: 2.0.1
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

> HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
> -
>
> Key: HDFS-2800
> URL: https://issues.apache.org/jira/browse/HDFS-2800
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, test
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: 2.0.1
>
> Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, 
> hdfs-2800.txt
>
>
> TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the 
> following assert on line 212 fail:
> {code}
> assertTrue(StandbyCheckpointer.getCanceledCount() > 0);
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13278013#comment-13278013
 ] 

Hudson commented on HDFS-1153:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2278 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2278/])
HDFS-1153. dfsnodelist.jsp should handle invalid input parameters. 
Contributed by Ravi Phulari (Revision 1339706)

 Result = ABORTED
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339706
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java


> dfsnodelist.jsp should handle invalid input parameters
> --
>
> Key: HDFS-1153
> URL: https://issues.apache.org/jira/browse/HDFS-1153
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 1.0.0
>Reporter: Ravi Phulari
>Assignee: Ravi Phulari
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: HDFS-1153.patch, hdfs-1153.txt
>
>
> Navigation to dfsnodelist.jsp  with invalid input parameters produces NPE and 
> HTTP 500 error. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp

2012-05-17 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277995#comment-13277995
 ] 

Eli Collins commented on HDFS-3434:
---

Thanks Todd. I've committed this to trunk and merged to branch-2. Filed 
HDFS-3437 for removing name.node.address.

> InvalidProtocolBufferException when visiting DN browseDirectory.jsp
> ---
>
> Key: HDFS-3434
> URL: https://issues.apache.org/jira/browse/HDFS-3434
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt
>
>
> The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When 
> selecting the "haus04" under the "Node" table I get a link with the http 
> address which is bogus (the wildcard/http port not the nn rpc addr), which 
> results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to 
> 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: 
> Connection refused". The browse this file system link works.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277988#comment-13277988
 ] 

Hudson commented on HDFS-3434:
--

Integrated in Hadoop-Common-trunk-Commit #2262 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2262/])
HDFS-3434. InvalidProtocolBufferException when visiting DN 
browseDirectory.jsp. Contributed by Eli Collins (Revision 1339712)

 Result = SUCCESS
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339712
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeHttpServer.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestJspHelper.java


> InvalidProtocolBufferException when visiting DN browseDirectory.jsp
> ---
>
> Key: HDFS-3434
> URL: https://issues.apache.org/jira/browse/HDFS-3434
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt
>
>
> The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When 
> selecting the "haus04" under the "Node" table I get a link with the http 
> address which is bogus (the wildcard/http port not the nn rpc addr), which 
> results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to 
> 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: 
> Connection refused". The browse this file system link works.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277980#comment-13277980
 ] 

Hudson commented on HDFS-1153:
--

Integrated in Hadoop-Common-trunk-Commit #2261 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2261/])
HDFS-1153. dfsnodelist.jsp should handle invalid input parameters. 
Contributed by Ravi Phulari (Revision 1339706)

 Result = SUCCESS
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339706
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java


> dfsnodelist.jsp should handle invalid input parameters
> --
>
> Key: HDFS-1153
> URL: https://issues.apache.org/jira/browse/HDFS-1153
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 1.0.0
>Reporter: Ravi Phulari
>Assignee: Ravi Phulari
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: HDFS-1153.patch, hdfs-1153.txt
>
>
> Navigation to dfsnodelist.jsp  with invalid input parameters produces NPE and 
> HTTP 500 error. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277977#comment-13277977
 ] 

Hudson commented on HDFS-1153:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2335 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2335/])
HDFS-1153. dfsnodelist.jsp should handle invalid input parameters. 
Contributed by Ravi Phulari (Revision 1339706)

 Result = SUCCESS
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339706
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NamenodeJspHelper.java


> dfsnodelist.jsp should handle invalid input parameters
> --
>
> Key: HDFS-1153
> URL: https://issues.apache.org/jira/browse/HDFS-1153
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 1.0.0
>Reporter: Ravi Phulari
>Assignee: Ravi Phulari
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: HDFS-1153.patch, hdfs-1153.txt
>
>
> Navigation to dfsnodelist.jsp  with invalid input parameters produces NPE and 
> HTTP 500 error. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3434) InvalidProtocolBufferException when visiting DN browseDirectory.jsp

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277978#comment-13277978
 ] 

Hudson commented on HDFS-3434:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2335 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2335/])
HDFS-3434. InvalidProtocolBufferException when visiting DN 
browseDirectory.jsp. Contributed by Eli Collins (Revision 1339712)

 Result = SUCCESS
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339712
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeHttpServer.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/common/TestJspHelper.java


> InvalidProtocolBufferException when visiting DN browseDirectory.jsp
> ---
>
> Key: HDFS-3434
> URL: https://issues.apache.org/jira/browse/HDFS-3434
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Eli Collins
>Assignee: Eli Collins
> Attachments: hdfs-3434.txt, hdfs-3434.txt, hdfs-3434.txt
>
>
> The nnaddr field on dfsnodelist.jsp is getting set incorrectly. When 
> selecting the "haus04" under the "Node" table I get a link with the http 
> address which is bogus (the wildcard/http port not the nn rpc addr), which 
> results in an error of "Call From haus04.mtv.cloudera.com/172.29.122.94 to 
> 0.0.0.0:10070 failed on connection exception: java.net.ConnectException: 
> Connection refused". The browse this file system link works.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-1153:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> dfsnodelist.jsp should handle invalid input parameters
> --
>
> Key: HDFS-1153
> URL: https://issues.apache.org/jira/browse/HDFS-1153
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 1.0.0
>Reporter: Ravi Phulari
>Assignee: Ravi Phulari
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: HDFS-1153.patch, hdfs-1153.txt
>
>
> Navigation to dfsnodelist.jsp  with invalid input parameters produces NPE and 
> HTTP 500 error. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3437) Remove name.node.address servlet attribute

2012-05-17 Thread Eli Collins (JIRA)
Eli Collins created HDFS-3437:
-

 Summary: Remove name.node.address servlet attribute
 Key: HDFS-3437
 URL: https://issues.apache.org/jira/browse/HDFS-3437
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: name-node
Affects Versions: 2.0.0
Reporter: Eli Collins
Assignee: Eli Collins


Per HDFS-3434 we should be able to get rid of NAMENODE_ADDRESS_ATTRIBUTE_KEY 
since we always call DfsServlet#createNameNodeProxy within the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-1153:
--

  Component/s: data-node
 Priority: Minor  (was: Major)
 Target Version/s:   (was: 2.0.1)
Affects Version/s: (was: 0.20.2)
   (was: 0.20.1)
   1.0.0
Fix Version/s: 2.0.1

I've committed this and merged to branch-2. Thanks Ravi!

> dfsnodelist.jsp should handle invalid input parameters
> --
>
> Key: HDFS-1153
> URL: https://issues.apache.org/jira/browse/HDFS-1153
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 1.0.0
>Reporter: Ravi Phulari
>Assignee: Ravi Phulari
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: HDFS-1153.patch, hdfs-1153.txt
>
>
> Navigation to dfsnodelist.jsp  with invalid input parameters produces NPE and 
> HTTP 500 error. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-1153) dfsnodelist.jsp should handle invalid input parameters

2012-05-17 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated HDFS-1153:
--

Summary: dfsnodelist.jsp should handle invalid input parameters  (was: The 
navigation to /dfsnodelist.jsp  with invalid input parameters produces NPE and 
HTTP 500 error)

> dfsnodelist.jsp should handle invalid input parameters
> --
>
> Key: HDFS-1153
> URL: https://issues.apache.org/jira/browse/HDFS-1153
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20.2
>Reporter: Ravi Phulari
>Assignee: Ravi Phulari
> Attachments: HDFS-1153.patch, hdfs-1153.txt
>
>
> Navigation to dfsnodelist.jsp  with invalid input parameters produces NPE and 
> HTTP 500 error. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2800) HA: TestStandbyCheckpoints.testCheckpointCancellation is racy

2012-05-17 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277960#comment-13277960
 ] 

Eli Collins commented on HDFS-2800:
---

+1 updated patch looks good

> HA: TestStandbyCheckpoints.testCheckpointCancellation is racy
> -
>
> Key: HDFS-2800
> URL: https://issues.apache.org/jira/browse/HDFS-2800
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, test
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Attachments: hdfs-2800.txt, hdfs-2800.txt, hdfs-2800.txt, 
> hdfs-2800.txt
>
>
> TestStandbyCheckpoints.testCheckpointCancellation is racy, have seen the 
> following assert on line 212 fail:
> {code}
> assertTrue(StandbyCheckpointer.getCanceledCount() > 0);
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2717) BookKeeper Journal output stream doesn't check addComplete rc

2012-05-17 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277877#comment-13277877
 ] 

Uma Maheswara Rao G commented on HDFS-2717:
---

Patch looks great Ivan.
@Rakesh or Jitendra, Do you have any more comments ?
I am +1 on this patch.

> BookKeeper Journal output stream doesn't check addComplete rc
> -
>
> Key: HDFS-2717
> URL: https://issues.apache.org/jira/browse/HDFS-2717
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Ivan Kelly
>Assignee: Ivan Kelly
> Attachments: HDFS-2717.2.diff, HDFS-2717.diff
>
>
> As summary says, we're not checking the addComplete return code, so there's a 
> change of losing update. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277830#comment-13277830
 ] 

Hudson commented on HDFS-3433:
--

Integrated in Hadoop-Mapreduce-trunk #1082 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/])
HDFS-3433. GetImageServlet should allow administrative requestors when 
security is enabled. Contributed by Aaron T. Myers. (Revision 1339540)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540
Files : 
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java


> GetImageServlet should allow administrative requestors when security is 
> enabled
> ---
>
> Key: HDFS-3433
> URL: https://issues.apache.org/jira/browse/HDFS-3433
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Fix For: 2.0.1
>
> Attachments: HDFS-3433.patch
>
>
> Currently the GetImageServlet only allows the NN and checkpointing nodes to 
> connect. Since we now have the fetchImage command in DFSAdmin, we should also 
> allow administrative requests as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-860) fuse-dfs truncate behavior causes issues with scp

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277829#comment-13277829
 ] 

Hudson commented on HDFS-860:
-

Integrated in Hadoop-Mapreduce-trunk #1082 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/])
HDFS-860. fuse-dfs truncate behavior causes issues with scp. Contributed by 
Brian Bockelman (Revision 1339413)

 Result = SUCCESS
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339413
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/fuse-dfs/src/fuse_impls_truncate.c


> fuse-dfs truncate behavior causes issues with scp
> -
>
> Key: HDFS-860
> URL: https://issues.apache.org/jira/browse/HDFS-860
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: contrib/fuse-dfs
>Reporter: Brian Bockelman
>Assignee: Brian Bockelman
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: HDFS-860.patch, hdfs-860.txt
>
>
> For whatever reason, scp issues a "truncate" once it's written a file to 
> truncate the file to the # of bytes it has written (i.e., if a file is X 
> bytes, it calls truncate(X)).
> This fails on the current fuse-dfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3413) TestFailureToReadEdits timing out

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277824#comment-13277824
 ] 

Hudson commented on HDFS-3413:
--

Integrated in Hadoop-Mapreduce-trunk #1082 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/])
HDFS-3413. TestFailureToReadEdits timing out. Contributed by Aaron T. 
Myers. (Revision 1339250)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339250
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestFailureToReadEdits.java


> TestFailureToReadEdits timing out
> -
>
> Key: HDFS-3413
> URL: https://issues.apache.org/jira/browse/HDFS-3413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, test
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>Priority: Critical
> Fix For: 2.0.1
>
> Attachments: HDFS-3413.patch
>
>
> HDFS-3026 made it so that an HA NN that does not fully complete an HA state 
> transition will exit immediately. TestFailureToReadEdits has a test case 
> which causes an incomplete state transition, thus causing a JVM exit in the 
> test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3422) TestStandbyIsHot timeouts too aggressive

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277825#comment-13277825
 ] 

Hudson commented on HDFS-3422:
--

Integrated in Hadoop-Mapreduce-trunk #1082 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1082/])
HDFS-3422. TestStandbyIsHot timeouts too aggressive. Contributed by Todd 
Lipcon. (Revision 1339452)

 Result = SUCCESS
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339452
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java


> TestStandbyIsHot timeouts too aggressive
> 
>
> Key: HDFS-3422
> URL: https://issues.apache.org/jira/browse/HDFS-3422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: hdfs-3422.txt
>
>
> I've seen TestStandbyIsHot timeout a few times waiting for replication to 
> change, but when I look at the logs, it appears everything is fine. The block 
> deletions are just a bit slow in being processed.
> To improve the test we should either increase the timeouts, or explicitly 
> trigger heartbeats and replication work after changing the desired 
> replication levels.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3422) TestStandbyIsHot timeouts too aggressive

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277738#comment-13277738
 ] 

Hudson commented on HDFS-3422:
--

Integrated in Hadoop-Hdfs-trunk #1048 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1048/])
HDFS-3422. TestStandbyIsHot timeouts too aggressive. Contributed by Todd 
Lipcon. (Revision 1339452)

 Result = SUCCESS
todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339452
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java


> TestStandbyIsHot timeouts too aggressive
> 
>
> Key: HDFS-3422
> URL: https://issues.apache.org/jira/browse/HDFS-3422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.0.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 2.0.1
>
> Attachments: hdfs-3422.txt
>
>
> I've seen TestStandbyIsHot timeout a few times waiting for replication to 
> change, but when I look at the logs, it appears everything is fine. The block 
> deletions are just a bit slow in being processed.
> To improve the test we should either increase the timeouts, or explicitly 
> trigger heartbeats and replication work after changing the desired 
> replication levels.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277741#comment-13277741
 ] 

Hudson commented on HDFS-3433:
--

Integrated in Hadoop-Hdfs-trunk #1048 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1048/])
HDFS-3433. GetImageServlet should allow administrative requestors when 
security is enabled. Contributed by Aaron T. Myers. (Revision 1339540)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540
Files : 
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java


> GetImageServlet should allow administrative requestors when security is 
> enabled
> ---
>
> Key: HDFS-3433
> URL: https://issues.apache.org/jira/browse/HDFS-3433
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Fix For: 2.0.1
>
> Attachments: HDFS-3433.patch
>
>
> Currently the GetImageServlet only allows the NN and checkpointing nodes to 
> connect. Since we now have the fetchImage command in DFSAdmin, we should also 
> allow administrative requests as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-860) fuse-dfs truncate behavior causes issues with scp

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277740#comment-13277740
 ] 

Hudson commented on HDFS-860:
-

Integrated in Hadoop-Hdfs-trunk #1048 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1048/])
HDFS-860. fuse-dfs truncate behavior causes issues with scp. Contributed by 
Brian Bockelman (Revision 1339413)

 Result = SUCCESS
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339413
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib/fuse-dfs/src/fuse_impls_truncate.c


> fuse-dfs truncate behavior causes issues with scp
> -
>
> Key: HDFS-860
> URL: https://issues.apache.org/jira/browse/HDFS-860
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: contrib/fuse-dfs
>Reporter: Brian Bockelman
>Assignee: Brian Bockelman
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: HDFS-860.patch, hdfs-860.txt
>
>
> For whatever reason, scp issues a "truncate" once it's written a file to 
> truncate the file to the # of bytes it has written (i.e., if a file is X 
> bytes, it calls truncate(X)).
> This fails on the current fuse-dfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277730#comment-13277730
 ] 

Hudson commented on HDFS-3433:
--

Integrated in Hadoop-Mapreduce-trunk-Commit #2275 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2275/])
HDFS-3433. GetImageServlet should allow administrative requestors when 
security is enabled. Contributed by Aaron T. Myers. (Revision 1339540)

 Result = ABORTED
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540
Files : 
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java


> GetImageServlet should allow administrative requestors when security is 
> enabled
> ---
>
> Key: HDFS-3433
> URL: https://issues.apache.org/jira/browse/HDFS-3433
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Fix For: 2.0.1
>
> Attachments: HDFS-3433.patch
>
>
> Currently the GetImageServlet only allows the NN and checkpointing nodes to 
> connect. Since we now have the fetchImage command in DFSAdmin, we should also 
> allow administrative requests as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3423) BookKeeperJournalManager: NN startup is failing, when tries to recoverUnfinalizedSegments() a bad inProgress_ ZNodes

2012-05-17 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277723#comment-13277723
 ] 

Rakesh R commented on HDFS-3423:


In the patch I'm seeing only maxTxId.reset(maxTxId.get()-1); is invoked on 
SegmentEmptyException. But I'm thinking about the inprogress_x ledgers which 
are not empty and had previously finalized but not deleted.

The following code I have taken from BKJM. When l.verify(zkc, finalisedPath) == 
true, here instead of storing the maxTxId and deleting the znode, we will only 
delete the inprogress_x node as we had corresponding edit_x_y log file exists. 
IMHO, this is more safer. what's your opinion?

{noformat}
  try {
l.write(zkc, finalisedPath);
  } catch (KeeperException.NodeExistsException nee) {
if (!l.verify(zkc, finalisedPath)) {
  throw new IOException("Node " + finalisedPath + " already exists"
+ " but data doesn't match");
}
  }
  maxTxId.store(lastTxId);
  zkc.delete(inprogressPath, inprogressStat.getVersion());
{noformat}

> BookKeeperJournalManager: NN startup is failing, when tries to 
> recoverUnfinalizedSegments() a bad inProgress_ ZNodes
> 
>
> Key: HDFS-3423
> URL: https://issues.apache.org/jira/browse/HDFS-3423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Rakesh R
>Assignee: Ivan Kelly
> Attachments: HDFS-3423.diff
>
>
> Say, the InProgress_000X node is corrupted due to not writing the 
> data(version, ledgerId, firstTxId) to this inProgress_000X znode. Namenode 
> startup has the logic to recover all the unfinalized segments, here will try 
> to read the segment and getting shutdown.
> {noformat}
> EditLogLedgerMetadata.java:
> static EditLogLedgerMetadata read(ZooKeeper zkc, String path)
>   throws IOException, KeeperException.NoNodeException  {
>   byte[] data = zkc.getData(path, false, null);
>   String[] parts = new String(data).split(";");
>   if (parts.length == 3)
>  reading inprogress metadata
>   else if (parts.length == 4)
>  reading inprogress metadata
>   else
> throw new IOException("Invalid ledger entry, "
>   + new String(data));
>   }
> {noformat}
> Scenario:- Leaving bad inProgress_000X node ?
> Assume BKJM has created the inProgress_000X zNode and ZK is not available 
> when trying to add the metadata. Now, inProgress_000X ends up with partial 
> information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.

2012-05-17 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277724#comment-13277724
 ] 

Vinay commented on HDFS-3436:
-

Scenario is as follows:
-
1. Cluster is having 4 DNs.
2. File is written to 3 DNs DN1->DN2->DN3 with genstamp of 1001
3. Now DN3 is stopped.
4. Now append is called.
5. For this append Client will try to create the pipeline DN1->DN2->DN3
During this time following things will happen
1. The Generation stamp will be updated in volumeMap to 1002
2. Now datanode will try to connect to next DN in pipeline.
If Next DN in pipeline is down, then exception will be thrown and client will 
try to reform the pipeline.

Now since DN3 is down, in DN1 and DN2 genstamp is already updated to 1002. But 
client doesnot know about this.
6. Now client is trying to add one more datanode to append pipeline. i.e. DN4. 
and ask DN1 or DN2 to transfer block to DN4. But Client will ask to transfer 
block with genstamp 1001.
7. Since DN1 and DN2 dont have block with genstamp 1001, so transfer will fail 
and Client write also will fail.

Proposed solution
--
In DataXceiver#writeBlock(), before creating the BlockReceiver instance, if we 
try to create mirror connection, then this solves the problem.

> Append to file is failing when one of the datanode where the block present is 
> down.
> ---
>
> Key: HDFS-3436
> URL: https://issues.apache.org/jira/browse/HDFS-3436
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 2.0.0
>Reporter: Brahma Reddy Battula
>Assignee: Vinay
>
> Scenario:
> =
> 1. Cluster with 4 DataNodes.
> 2. Written file to 3 DNs, DN1->DN2->DN3
> 3. Stopped DN3,
> Now Append to file is failing due to addDatanode2ExistingPipeline is failed.
>  *CLinet Trace* 
> {noformat}
> 2012-04-24 22:06:09,947 INFO  hdfs.DFSClient 
> (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in 
> createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as ***:50010
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:09,947 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery 
> for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 
> in pipeline *:50010, **:50010, *:50010: bad datanode **:50010
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient (DFSOutputStream.java:run(549)) 
> - DataStreamer Exception
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:hflush(1515)) - Error while syncing
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> {noformat}
>  *DataNode Trace*  
> {no

[jira] [Updated] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.

2012-05-17 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-3436:


Description: 
Scenario:
=

1. Cluster with 4 DataNodes.
2. Written file to 3 DNs, DN1->DN2->DN3
3. Stopped DN3,
Now Append to file is failing due to addDatanode2ExistingPipeline is failed.

 *CLinet Trace* 
{noformat}
2012-04-24 22:06:09,947 INFO  hdfs.DFSClient 
(DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in 
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as ***:50010
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
2012-04-24 22:06:09,947 WARN  hdfs.DFSClient 
(DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery 
for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 in 
pipeline *:50010, **:50010, *:50010: bad datanode **:50010
2012-04-24 22:06:10,072 WARN  hdfs.DFSClient (DFSOutputStream.java:run(549)) - 
DataStreamer Exception
java.io.EOFException: Premature EOF: no length prefix available
at 
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
2012-04-24 22:06:10,072 WARN  hdfs.DFSClient 
(DFSOutputStream.java:hflush(1515)) - Error while syncing
java.io.EOFException: Premature EOF: no length prefix available
at 
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
java.io.EOFException: Premature EOF: no length prefix available
at 
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
{noformat}

 *DataNode Trace*  

{noformat}

2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - 
host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation  src: 
/127.0.0.1:49811 dest: /127.0.0.1:49744
java.io.IOException: 
BP-2001850558-xx.xx.xx.xx-1337249347060:blk_-8165642083860293107_1002 is 
neither a RBW nor a Finalized, r=ReplicaBeingWritten, 
blk_-8165642083860293107_1003, RBW
  getNumBytes() = 1024
  getBytesOnDisk()  = 1024
  getVisibleLength()= 1024
  getVolume()   = 
E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current
  getBlockFile()= 
E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-xx.xx.xx.xx-1337249347060\current\rbw\blk_-8165642083860293107
  bytesAcked=1024
  bytesOnDisk=102
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:114)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:78)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:189)
at java.lang.Thread.run(Unknown Source)
{noformat}

  was:
Scenario:
=

1. Cluster with 4 DataNodes.
2. Written file to 3 DNs, DN1->DN2->DN3
3. Stopped DN3,
Now Append to file is failing due to addDatanode2ExistingPipeline is failed.

 *CLinet Trace* 
{noformat}
2012-04-24 22:06:09,947 INFO  hdfs.DFSClient 
(DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in 
createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as **

[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277715#comment-13277715
 ] 

Hudson commented on HDFS-3433:
--

Integrated in Hadoop-Common-trunk-Commit #2258 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2258/])
HDFS-3433. GetImageServlet should allow administrative requestors when 
security is enabled. Contributed by Aaron T. Myers. (Revision 1339540)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540
Files : 
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java


> GetImageServlet should allow administrative requestors when security is 
> enabled
> ---
>
> Key: HDFS-3433
> URL: https://issues.apache.org/jira/browse/HDFS-3433
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Fix For: 2.0.1
>
> Attachments: HDFS-3433.patch
>
>
> Currently the GetImageServlet only allows the NN and checkpointing nodes to 
> connect. Since we now have the fetchImage command in DFSAdmin, we should also 
> allow administrative requests as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HDFS-3436) Append to file is failing when one of the datanode where the block present is down.

2012-05-17 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay reassigned HDFS-3436:
---

Assignee: Vinay

> Append to file is failing when one of the datanode where the block present is 
> down.
> ---
>
> Key: HDFS-3436
> URL: https://issues.apache.org/jira/browse/HDFS-3436
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 2.0.0
>Reporter: Brahma Reddy Battula
>Assignee: Vinay
>
> Scenario:
> =
> 1. Cluster with 4 DataNodes.
> 2. Written file to 3 DNs, DN1->DN2->DN3
> 3. Stopped DN3,
> Now Append to file is failing due to addDatanode2ExistingPipeline is failed.
>  *CLinet Trace* 
> {noformat}
> 2012-04-24 22:06:09,947 INFO  hdfs.DFSClient 
> (DFSOutputStream.java:createBlockOutputStream(1063)) - Exception in 
> createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as ***:50010
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1053)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:943)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:09,947 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:setupPipelineForAppendOrRecovery(916)) - Error Recovery 
> for block BP-1023239-10.18.40.233-1335275282109:blk_296651611851855249_1253 
> in pipeline *:50010, **:50010, *:50010: bad datanode **:50010
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient (DFSOutputStream.java:run(549)) 
> - DataStreamer Exception
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> 2012-04-24 22:06:10,072 WARN  hdfs.DFSClient 
> (DFSOutputStream.java:hflush(1515)) - Error while syncing
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> java.io.EOFException: Premature EOF: no length prefix available
>   at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:866)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461)
> {noformat}
>  *DataNode Trace*  
> {noformat}
> 2012-05-17 15:39:12,261 ERROR datanode.DataNode (DataXceiver.java:run(193)) - 
> host0.foo.com:49744:DataXceiver error processing TRANSFER_BLOCK operation  
> src: /127.0.0.1:49811 dest: /127.0.0.1:49744
> java.io.IOException: 
> BP-2001850558-10.18.47.190-1337249347060:blk_-8165642083860293107_1002 is 
> neither a RBW nor a Finalized, r=ReplicaBeingWritten, 
> blk_-8165642083860293107_1003, RBW
>   getNumBytes() = 1024
>   getBytesOnDisk()  = 1024
>   getVisibleLength()= 1024
>   getVolume()   = 
> E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current
>   getBlockFile()= 
> E:\MyWorkSpace\branch-2\Test\build\test\data\dfs\data\data1\current\BP-2001850558-10.18.47.190-1337249347060\current\rbw\blk_-8165642083860293107
>   bytesAcked=1024
>   bytesOnDisk=102
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2038)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:525)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:114)
>   at 
> org.apache.hadoop.hdfs.protocol.d

[jira] [Commented] (HDFS-3433) GetImageServlet should allow administrative requestors when security is enabled

2012-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277713#comment-13277713
 ] 

Hudson commented on HDFS-3433:
--

Integrated in Hadoop-Hdfs-trunk-Commit #2332 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2332/])
HDFS-3433. GetImageServlet should allow administrative requestors when 
security is enabled. Contributed by Aaron T. Myers. (Revision 1339540)

 Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1339540
Files : 
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/http/HttpServer.java
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GetImageServlet.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestGetImageServlet.java


> GetImageServlet should allow administrative requestors when security is 
> enabled
> ---
>
> Key: HDFS-3433
> URL: https://issues.apache.org/jira/browse/HDFS-3433
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Fix For: 2.0.1
>
> Attachments: HDFS-3433.patch
>
>
> Currently the GetImageServlet only allows the NN and checkpointing nodes to 
> connect. Since we now have the fetchImage command in DFSAdmin, we should also 
> allow administrative requests as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >