[jira] [Commented] (HDFS-3107) HDFS truncate

2014-10-20 Thread Plamen Jeliazkov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178044#comment-14178044
 ] 

Plamen Jeliazkov commented on HDFS-3107:


[~srivas], 
There is no plan to grow the file by padding it with zeroes as general-purpose 
truncate does. Both [~shv] and [~lei_chang] mentioned this in their design 
docs, I believe.

[~cmccabe],
While the copying the last block up to its truncate point and doing a 
delete/concat is definitely a simpler overall approach, the full truncate 
implementation has the benefit of being a single NameNode RPC call that can 
both truncate in-place and copy-on-truncate, preserving the original last block 
and moving the 'copy&truncate' work to the DataNodes themselves (as opposed to 
having to pass data through the network / client). I am not intending to debate 
either implementation -- I like both personally; just wanted to explain as 
briefly as I could why Konstantin and I are taking our approach.

> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Lei Chang
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, 
> HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
> editsStored, editsStored, editsStored.xml
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7254) Add documents for hot swap drive

2014-10-20 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178009#comment-14178009
 ] 

Fengdong Yu commented on HDFS-7254:
---

bq.<<>>

should be dfs.datanode.data.dir

> Add documents for hot swap drive
> 
>
> Key: HDFS-7254
> URL: https://issues.apache.org/jira/browse/HDFS-7254
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 2.5.1
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch
>
>
> Add documents for the hot swap drive functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7056) Snapshot support for truncate

2014-10-20 Thread Guo Ruijing (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guo Ruijing updated HDFS-7056:
--
Attachment: HDFSSnapshotWithTruncateDesign.docx

Attached HDFS Snapshot With Truncate Design for reference/review. 

> Snapshot support for truncate
> -
>
> Key: HDFS-7056
> URL: https://issues.apache.org/jira/browse/HDFS-7056
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
> Attachments: HDFSSnapshotWithTruncateDesign.docx
>
>
> Implementation of truncate in HDFS-3107 does not allow truncating files which 
> are in a snapshot. It is desirable to be able to truncate and still keep the 
> old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7225) Failed DataNode lookup can crash NameNode with NullPointerException

2014-10-20 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177948#comment-14177948
 ] 

Zhe Zhang commented on HDFS-7225:
-

[~andrew.wang] Thanks for the suggestion. I think that's a good idea. It 
assumes that the NN will make the same decision to invalidate those blocks when 
the volume is back. I think it's a valid assumption. I'll implement that option.

> Failed DataNode lookup can crash NameNode with NullPointerException
> ---
>
> Key: HDFS-7225
> URL: https://issues.apache.org/jira/browse/HDFS-7225
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7225-v1.patch
>
>
> {{BlockManager#invalidateWorkForOneNode}} looks up a DataNode by the 
> {{datanodeUuid}} and passes the resultant {{DatanodeDescriptor}} to 
> {{InvalidateBlocks#invalidateWork}}. However, if a wrong or outdated 
> {{datanodeUuid}} is used, a null pointer will be passed to {{invalidateWork}} 
> which will use it to lookup in a {{TreeMap}}. Since the key type is 
> {{DatanodeDescriptor}}, key comparison is based on the IP address. A null key 
> will crash the NameNode with an NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone

2014-10-20 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177941#comment-14177941
 ] 

Yi Liu commented on HDFS-7243:
--

Hi Charles, I was going to commit this patch just now, but found another issue, 
sorry for missing that in previous comments.
{code}
dir.getINodesInPath4Write(target, true);
{code}
we should call
{code}
dir.getINodesInPath4Write(target);
{code}
Since the later will hold the FsDir read lock.

Besides, another small nits in the test, 
{code}
fs.concat(new Path(ez, "target"), new Path[] { src1, src2 });
{code}
We could use {{target}} instead of {{new Path(ez, "target")}} 

> HDFS concat operation should not be allowed in Encryption Zone
> --
>
> Key: HDFS-7243
> URL: https://issues.apache.org/jira/browse/HDFS-7243
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.6.0
>Reporter: Yi Liu
>Assignee: Charles Lamb
> Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch, 
> HDFS-7243.003.patch
>
>
> For HDFS encryption at rest, files in an encryption zone are using different 
> data encryption keys, so concat should be disallowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177942#comment-14177942
 ] 

Hadoop QA commented on HDFS-3342:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676008/HDFS-3342.002.patch
  against trunk revision 7aab5fa.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestDecommission
  
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8467//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8467//console

This message is automatically generated.

> SocketTimeoutException in BlockSender.sendChunks could have a better error 
> message
> --
>
> Key: HDFS-3342
> URL: https://issues.apache.org/jira/browse/HDFS-3342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Todd Lipcon
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch, 
> HDFS-3342.002.patch
>
>
> Currently, if a client connects to a DN and begins to read a block, but then 
> stops calling read() for a long period of time, the DN will log a 
> SocketTimeoutException "48 millis timeout while waiting for channel to be 
> ready for write." This is because there is no "keepalive" functionality of 
> any kind. At a minimum, we should improve this error message to be an INFO 
> level log which just says that the client likely stopped reading, so 
> disconnecting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7269) NN and DN don't check whether corrupted blocks reported by clients are actually corrupted

2014-10-20 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177905#comment-14177905
 ] 

Ming Ma commented on HDFS-7269:
---

Nicholas, in our case, the client only reported one replica for each 
reportBadBlocks call. But given there were multiple DFSInputStream read calls 
for a given block and each read call could mark one replica bad, all replicas 
were marked as bad.

> NN and DN don't check whether corrupted blocks reported by clients are 
> actually corrupted
> -
>
> Key: HDFS-7269
> URL: https://issues.apache.org/jira/browse/HDFS-7269
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>
> We had a case where the client machine had memory issue and thus failed the 
> checksum validation of a given block for all its replicas. So the client 
> ended up informing NN about the corrupted blocks for all DNs via 
> reportBadBlocks. However, the block isn't corrupted on any of the DNs. You 
> can still use DFSClient to read the block. But in order to get rid of NN's 
> warning message for corrupt block, we either do a NN fail over, or repair the 
> file via a) copy the file somewhere, b) remove the file, c) copy the file 
> back.
> It will be useful if NN and DN can validate client's report. In fact, there 
> is a comment in NamenodeRpcServer about this.
> {noformat}
>   /**
>* The client has detected an error on the specified located blocks 
>* and is reporting them to the server.  For now, the namenode will 
>* mark the block as corrupt.  In the future we might 
>* check the blocks are actually corrupt. 
>*/
> {noformat}
> To allow system to recover from invalid client report quickly, we can support 
> automatic recovery or manual admins command.
> 1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN 
> will notify the validate result via IBR and new 
> ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK.
> 2. Some new admins command to move corrupted blocks out of BM's 
> CorruptReplicasMap and UnderReplicatedBlocks.
> Appreciate any input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7271) Find a way to make encryption zone deletion work with HDFS trash.

2014-10-20 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-7271.
--
Resolution: Invalid

The "-skipTrash" already exists for rm op, so resolve it as invalid.

> Find a way to make encryption zone deletion work with HDFS trash.
> -
>
> Key: HDFS-7271
> URL: https://issues.apache.org/jira/browse/HDFS-7271
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption
>Affects Versions: 2.6.0
>Reporter: Yi Liu
>Assignee: Yi Liu
>
> Currently when HDFS trash is enabled, deletion of encryption zone will have 
> issue:
> {quote}
> rmr: Failed to move to trash: ... can't be moved from an encryption zone.
> {quote}
> A simple way is to add ignore trash flag for fs rm operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3107) HDFS truncate

2014-10-20 Thread M. C. Srivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177900#comment-14177900
 ] 

M. C. Srivas commented on HDFS-3107:


Note that a general-purpose truncate can be used to also *increase* the size of 
the file.  Used very often, for example, to implement a database and growing 
the file if it isn't large enough.  Are you planning to implement truncate to 
behave so too?


> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Lei Chang
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, 
> HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
> editsStored, editsStored, editsStored.xml
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7271) Find a way to make encryption zone deletion work with HDFS trash.

2014-10-20 Thread Yi Liu (JIRA)
Yi Liu created HDFS-7271:


 Summary: Find a way to make encryption zone deletion work with 
HDFS trash.
 Key: HDFS-7271
 URL: https://issues.apache.org/jira/browse/HDFS-7271
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: encryption
Affects Versions: 2.6.0
Reporter: Yi Liu
Assignee: Yi Liu


Currently when HDFS trash is enabled, deletion of encryption zone will have 
issue:
{quote}
rmr: Failed to move to trash: ... can't be moved from an encryption zone.
{quote}
A simple way is to add ignore trash flag for fs rm operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7254) Add documents for hot swap drive

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177888#comment-14177888
 ] 

Hadoop QA commented on HDFS-7254:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675986/HDFS-7254.001.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing
  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.tools.TestDFSAdminWithHA

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8465//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8465//console

This message is automatically generated.

> Add documents for hot swap drive
> 
>
> Key: HDFS-7254
> URL: https://issues.apache.org/jira/browse/HDFS-7254
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 2.5.1
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch
>
>
> Add documents for the hot swap drive functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177885#comment-14177885
 ] 

Gopal V commented on HDFS-7266:
---

That was quick!, thanks [~cmccabe] & [~andrew.wang].

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>Priority: Minor
>  Labels: multi-threading
> Fix For: 2.7.0
>
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3107) HDFS truncate

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3107:
---
Attachment: (was: HDFS-3107.008.patch)

> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Lei Chang
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, 
> HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
> editsStored, editsStored, editsStored.xml
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3107) HDFS truncate

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3107:
---
Attachment: HDFS-3107.008.patch

fix log message which should be trace, not info

> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Lei Chang
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107.008.patch, HDFS-3107.008.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS_truncate.pdf, 
> HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, 
> HDFS_truncate_semantics_Mar21.pdf, editsStored, editsStored, editsStored.xml
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3107) HDFS truncate

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-3107:
---
Attachment: HDFS-3107.008.patch

Hi all,

Here's a patch which implements truncate in such a way that it works with 
snapshots.

This doesn't modify the last replica file of the truncated file in place.  
Instead, it writes out a new file with the new (shorter) contents of the last 
replica file, and uses concat to combine it with the first part of the file.

> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Lei Chang
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107.008.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate.pdf, 
> HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
> editsStored, editsStored, editsStored.xml
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk

2014-10-20 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177853#comment-14177853
 ] 

Yongjun Zhang commented on HDFS-7235:
-

HI [~cmccabe],

Thanks again for the review. Please see my answer below.
{quote}
We shouldn't log a message saying that "the block file doesn't exist" if the 
block file exists, but is not finalized.
{quote}
We are not, we only log when the state is finalized, and the block file doesn't 
exist. 

{quote}
I also don't see why we need to call FSDatasetSpi#getLength, if we already have 
access to the replica length here.
{quote}
The new fix we are introducing here is to handle a special case that when 
{{isValidBlock()}} returns false, so I tried to limit the change in the special 
handling block. If we remove the pre-exiisting {{FSDatasetSpi#getLength}}, we 
need to move the call {{getReplica()}} out of the false block.
The {{getReplica()}} was marked {{@Deprecated}}, I consider calling it is a bit 
hack here already, Plus, we need to synchronize the whole block of code, so I 
hope we can limit the impact to within the false block. I wonder if this 
explanation makes sense to you.

{quote}
I would suggest having your synchronized section set a string named 
replicaProblem. Then if the string is null at the end, there is no problem-- 
otherwise, the problem is contained in replicaProblem. Then you can check 
existence, replica state, and length all at once.
{quote}
I am not sure I follow what you said, will check in person.
{quote}
We don't even need to call isValidBlock. getReplica gives you all the info you 
need. Please take out this call, since it's unnecessary.
{quote}
The {isValidBlock}} is an interface defined in FsDatasetSpi, and has its 
methods defined in derived classes FsDatasetImpl, and SimulatedFSDataset etc, 
which might have different implementation of the methods. It'd be nice to stick 
to the interface of FsDatasetSpi. 

Will discuss with you more.

Thanks again.



> Can not decommission DN which has invalid block due to bad disk
> ---
>
> Key: HDFS-7235
> URL: https://issues.apache.org/jira/browse/HDFS-7235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, 
> HDFS-7235.003.patch
>
>
> When to decommission a DN, the process hangs. 
> What happens is, when NN chooses a replica as a source to replicate data on 
> the to-be-decommissioned DN to other DNs, it favors choosing this DN 
> to-be-decommissioned as the source of transfer (see BlockManager.java).  
> However, because of the bad disk, the DN would detect the source block to be 
> transfered as invalidBlock with the following logic in FsDatasetImpl.java:
> {code}
> /** Does the block exist and have the given state? */
>   private boolean isValid(final ExtendedBlock b, final ReplicaState state) {
> final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), 
> b.getLocalBlock());
> return replicaInfo != null
> && replicaInfo.getState() == state
> && replicaInfo.getBlockFile().exists();
>   }
> {code}
> The reason that this method returns false (detecting invalid block) is 
> because the block file doesn't exist due to bad disk in this case. 
> The key issue we found here is, after DN detects an invalid block for the 
> above reason, it doesn't report the invalid block back to NN, thus NN doesn't 
> know that the block is corrupted, and keeps sending the data transfer request 
> to the same DN to be decommissioned, again and again. This caused an infinite 
> loop, so the decommission process hangs.
> Thanks [~qwertymaniac] for reporting the issue and initial analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177828#comment-14177828
 ] 

Hadoop QA commented on HDFS-7235:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675964/HDFS-7235.003.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1304 javac 
compiler warnings (more than the trunk's current 1293 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.datanode.TestRefreshNamenodes
  org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing
  org.apache.hadoop.hdfs.server.namenode.ha.TestHAFsck
  
org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.fs.TestSymlinkHdfsFileSystem

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8462//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8462//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8462//console

This message is automatically generated.

> Can not decommission DN which has invalid block due to bad disk
> ---
>
> Key: HDFS-7235
> URL: https://issues.apache.org/jira/browse/HDFS-7235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, 
> HDFS-7235.003.patch
>
>
> When to decommission a DN, the process hangs. 
> What happens is, when NN chooses a replica as a source to replicate data on 
> the to-be-decommissioned DN to other DNs, it favors choosing this DN 
> to-be-decommissioned as the source of transfer (see BlockManager.java).  
> However, because of the bad disk, the DN would detect the source block to be 
> transfered as invalidBlock with the following logic in FsDatasetImpl.java:
> {code}
> /** Does the block exist and have the given state? */
>   private boolean isValid(final ExtendedBlock b, final ReplicaState state) {
> final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), 
> b.getLocalBlock());
> return replicaInfo != null
> && replicaInfo.getState() == state
> && replicaInfo.getBlockFile().exists();
>   }
> {code}
> The reason that this method returns false (detecting invalid block) is 
> because the block file doesn't exist due to bad disk in this case. 
> The key issue we found here is, after DN detects an invalid block for the 
> above reason, it doesn't report the invalid block back to NN, thus NN doesn't 
> know that the block is corrupted, and keeps sending the data transfer request 
> to the same DN to be decommissioned, again and again. This caused an infinite 
> loop, so the decommission process hangs.
> Thanks [~qwertymaniac] for reporting the issue and initial analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-5928) show namespace and namenode ID on NN dfshealth page

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177829#comment-14177829
 ] 

Hadoop QA commented on HDFS-5928:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675962/HDFS-5928.v4.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1304 javac 
compiler warnings (more than the trunk's current 1293 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing
  
org.apache.hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.fs.TestSymlinkHdfsFileSystem

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8461//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8461//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8461//console

This message is automatically generated.

> show namespace and namenode ID on NN dfshealth page
> ---
>
> Key: HDFS-5928
> URL: https://issues.apache.org/jira/browse/HDFS-5928
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siqi Li
>Assignee: Siqi Li
> Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, 
> HDFS-5928.v4.patch, HDFS-5928.v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177825#comment-14177825
 ] 

Hadoop QA commented on HDFS-7180:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676001/HDFS-7180.001.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs-nfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8466//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8466//console

This message is automatically generated.

> NFSv3 gateway frequently gets stuck
> ---
>
> Key: HDFS-7180
> URL: https://issues.apache.org/jira/browse/HDFS-7180
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.5.0
> Environment: Linux, Fedora 19 x86-64
>Reporter: Eric Zhiqiang Ma
>Assignee: Brandon Li
>Priority: Critical
> Attachments: HDFS-7180.001.patch
>
>
> We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway 
> on one node in the cluster to let users upload data with rsync.
> However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
> seems working well. (hdfds dfs -ls and etc. works just well). The last stuck 
> we found is after around 1 day running and several hundreds GBs of data 
> uploaded.
> The NFSv3 daemon is started on one node and on the same node the NFS is 
> mounted.
> From the node where the NFS is mounted:
> dmsg shows like this:
> [1859245.368108] nfs: server localhost not responding, still trying
> [1859245.368111] nfs: server localhost not responding, still trying
> [1859245.368115] nfs: server localhost not responding, still trying
> [1859245.368119] nfs: server localhost not responding, still trying
> [1859245.368123] nfs: server localhost not responding, still trying
> [1859245.368127] nfs: server localhost not responding, still trying
> [1859245.368131] nfs: server localhost not responding, still trying
> [1859245.368135] nfs: server localhost not responding, still trying
> [1859245.368138] nfs: server localhost not responding, still trying
> [1859245.368142] nfs: server localhost not responding, still trying
> [1859245.368146] nfs: server localhost not responding, still trying
> [1859245.368150] nfs: server localhost not responding, still trying
> [1859245.368153] nfs: server localhost not responding, still trying
> The mounted directory can not be `ls` and `df -hT` gets stuck too.
> The latest lines from the nfs3 log in the hadoop logs directory:
> 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> user map size: 35
> 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> group map size: 54
> 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:23,809 I

[jira] [Commented] (HDFS-7259) Unresponseive NFS mount point due to deferred COMMIT response

2014-10-20 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177821#comment-14177821
 ] 

Jing Zhao commented on HDFS-7259:
-

Thanks for working on this, Brandon! The patch looks good to me. +1.

> Unresponseive NFS mount point due to deferred COMMIT response
> -
>
> Key: HDFS-7259
> URL: https://issues.apache.org/jira/browse/HDFS-7259
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.2.0
>Reporter: Brandon Li
>Assignee: Brandon Li
> Attachments: HDFS-7259.001.patch, HDFS-7259.002.patch
>
>
> Since the gateway can't commit random write, it caches the COMMIT requests in 
> a queue and send back response only when the data can be committed or stream 
> timeout (failure in the latter case). This could cause problems two patterns:
> (1) file uploading failure 
> (2) the mount dir is stuck on the same client, but other NFS clients can 
> still access NFS gateway.
> The error pattern (2) is because there are too many COMMIT requests pending, 
> so the NFS client can't send any other requests(e.g., for "ls") to NFS 
> gateway with its pending requests limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7154) Fix returning value of starting reconfiguration task

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7154:
---
   Resolution: Fixed
Fix Version/s: 2.6.0
   Status: Resolved  (was: Patch Available)

committed.  Thanks, Eddy.

> Fix returning value of starting reconfiguration task
> 
>
> Key: HDFS-7154
> URL: https://issues.apache.org/jira/browse/HDFS-7154
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 3.0.0, 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Fix For: 2.6.0
>
> Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, 
> HDFS-7154.001.patch, HDFS-7154.001.patch
>
>
> Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} 
> (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit 
> code. It is expected to return 0 to indicate success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7154) Fix returning value of starting reconfiguration task

2014-10-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177810#comment-14177810
 ] 

Hudson commented on HDFS-7154:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6296 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6296/])
HDFS-7154. Fix returning value of starting reconfiguration task (Lei Xu via 
Colin P. McCabe) (cmccabe: rev 7aab5fa1bd9386b036af45cd8206622a4555d74a)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/tools/TestDFSAdmin.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSAdmin.java


> Fix returning value of starting reconfiguration task
> 
>
> Key: HDFS-7154
> URL: https://issues.apache.org/jira/browse/HDFS-7154
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 3.0.0, 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Fix For: 2.6.0
>
> Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, 
> HDFS-7154.001.patch, HDFS-7154.001.patch
>
>
> Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} 
> (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit 
> code. It is expected to return 0 to indicate success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message

2014-10-20 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-3342:

Attachment: HDFS-3342.002.patch

The eclipse:eclipse build issue appears to be a glitch, upload same patch again 
to trigger another run.


> SocketTimeoutException in BlockSender.sendChunks could have a better error 
> message
> --
>
> Key: HDFS-3342
> URL: https://issues.apache.org/jira/browse/HDFS-3342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Todd Lipcon
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch, 
> HDFS-3342.002.patch
>
>
> Currently, if a client connects to a DN and begins to read a block, but then 
> stops calling read() for a long period of time, the DN will log a 
> SocketTimeoutException "48 millis timeout while waiting for channel to be 
> ready for write." This is because there is no "keepalive" functionality of 
> any kind. At a minimum, we should improve this error message to be an INFO 
> level log which just says that the client likely stopped reading, so 
> disconnecting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177795#comment-14177795
 ] 

Hudson commented on HDFS-7266:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6295 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6295/])
HDFS-7266. HDFS Peercache enabled check should not lock on object (awang via 
cmccabe) (cmccabe: rev 4799570dfdb7987c2ac39716143341e9a3d9b7d2)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/PeerCache.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>Priority: Minor
>  Labels: multi-threading
> Fix For: 2.7.0
>
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7254) Add documents for hot swap drive

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177793#comment-14177793
 ] 

Colin Patrick McCabe commented on HDFS-7254:


+1.  Thanks, Eddy.

> Add documents for hot swap drive
> 
>
> Key: HDFS-7254
> URL: https://issues.apache.org/jira/browse/HDFS-7254
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 2.5.1
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch
>
>
> Add documents for the hot swap drive functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177792#comment-14177792
 ] 

Colin Patrick McCabe commented on HDFS-7235:


{code}
1786  boolean needToReportBadBlock = false;
1787  synchronized(data) {
1788ReplicaInfo replicaInfo = (ReplicaInfo) data.getReplica(
1789block.getBlockPoolId(), block.getBlockId());
1790needToReportBadBlock = (replicaInfo != null
1791&& replicaInfo.getState() == ReplicaState.FINALIZED
1792&& !replicaInfo.getBlockFile().exists());
1793  }
1794  if (needToReportBadBlock)  {
1795// Report back to NN bad block caused by non-existent block 
file.
1796reportBadBlock(bpos, block, "Can't replicate block " + block
1797+ " because the block file doesn't exist");
1798  } else {
1799String errStr = "Can't send invalid block " + block;
1800LOG.info(errStr);
1801bpos.trySendErrorReport(DatanodeProtocol.INVALID_BLOCK, errStr);
1802  }
{code}

We shouldn't log a message saying that "the block file doesn't exist" if the 
block file exists, but is not finalized.

I also don't see why we need to call {{FSDatasetSpi#getLength}}, if we already 
have access to the replica length here.

I would suggest having your synchronized section set a string named 
{{replicaProblem}}.  Then if the string is null at the end, there is no 
problem-- otherwise, the problem is contained in {{replicaProblem}}.  Then you 
can check existence, replica state, and length all at once.

bq. BTW, about the WATCH-OUT, I was just thinking that someone could add 
another condition in the FsDatasetImpl#isValidBlock that makes the method to 
return false. But that's remote and probably won't happen.

We don't even need to call {{isValidBlock}}.  {{getReplica}} gives you all the 
info you need.  Please take out this call, since it's unnecessary.

> Can not decommission DN which has invalid block due to bad disk
> ---
>
> Key: HDFS-7235
> URL: https://issues.apache.org/jira/browse/HDFS-7235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, 
> HDFS-7235.003.patch
>
>
> When to decommission a DN, the process hangs. 
> What happens is, when NN chooses a replica as a source to replicate data on 
> the to-be-decommissioned DN to other DNs, it favors choosing this DN 
> to-be-decommissioned as the source of transfer (see BlockManager.java).  
> However, because of the bad disk, the DN would detect the source block to be 
> transfered as invalidBlock with the following logic in FsDatasetImpl.java:
> {code}
> /** Does the block exist and have the given state? */
>   private boolean isValid(final ExtendedBlock b, final ReplicaState state) {
> final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), 
> b.getLocalBlock());
> return replicaInfo != null
> && replicaInfo.getState() == state
> && replicaInfo.getBlockFile().exists();
>   }
> {code}
> The reason that this method returns false (detecting invalid block) is 
> because the block file doesn't exist due to bad disk in this case. 
> The key issue we found here is, after DN detects an invalid block for the 
> above reason, it doesn't report the invalid block back to NN, thus NN doesn't 
> know that the block is corrupted, and keeps sending the data transfer request 
> to the same DN to be decommissioned, again and again. This caused an infinite 
> loop, so the decommission process hangs.
> Thanks [~qwertymaniac] for reporting the issue and initial analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7269) NN and DN don't check whether corrupted blocks reported by clients are actually corrupted

2014-10-20 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177789#comment-14177789
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7269:
---

By HDFS-1371, the client should not report checksum failure when all the nodes 
are bad.  Do the files have only one replica in your case?

> NN and DN don't check whether corrupted blocks reported by clients are 
> actually corrupted
> -
>
> Key: HDFS-7269
> URL: https://issues.apache.org/jira/browse/HDFS-7269
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>
> We had a case where the client machine had memory issue and thus failed the 
> checksum validation of a given block for all its replicas. So the client 
> ended up informing NN about the corrupted blocks for all DNs via 
> reportBadBlocks. However, the block isn't corrupted on any of the DNs. You 
> can still use DFSClient to read the block. But in order to get rid of NN's 
> warning message for corrupt block, we either do a NN fail over, or repair the 
> file via a) copy the file somewhere, b) remove the file, c) copy the file 
> back.
> It will be useful if NN and DN can validate client's report. In fact, there 
> is a comment in NamenodeRpcServer about this.
> {noformat}
>   /**
>* The client has detected an error on the specified located blocks 
>* and is reporting them to the server.  For now, the namenode will 
>* mark the block as corrupt.  In the future we might 
>* check the blocks are actually corrupt. 
>*/
> {noformat}
> To allow system to recover from invalid client report quickly, we can support 
> automatic recovery or manual admins command.
> 1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN 
> will notify the validate result via IBR and new 
> ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK.
> 2. Some new admins command to move corrupted blocks out of BM's 
> CorruptReplicasMap and UnderReplicatedBlocks.
> Appreciate any input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7266:
---
  Resolution: Fixed
   Fix Version/s: 2.7.0
Target Version/s: 2.7.0
  Status: Resolved  (was: Patch Available)

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>Priority: Minor
>  Labels: multi-threading
> Fix For: 2.7.0
>
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1418#comment-1418
 ] 

Colin Patrick McCabe commented on HDFS-7266:


+1. Test failures look like HDFS-7226, not related.  No new tests are needed 
because this is a small change to locking which is covered by the previous 
PeerCache tests.  Will commit momentarily.  Thanks Andrew and Gopal!

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>Priority: Minor
>  Labels: multi-threading
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7215) Add JvmPauseMonitor to NFS gateway

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1413#comment-1413
 ] 

Colin Patrick McCabe commented on HDFS-7215:


+1 for the current patch.  Will commit tomorrow if nobody has any more comments.

> Add JvmPauseMonitor to NFS gateway
> --
>
> Key: HDFS-7215
> URL: https://issues.apache.org/jira/browse/HDFS-7215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: nfs
>Affects Versions: 2.2.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Minor
> Attachments: HDFS-7215.001.patch
>
>
> Like NN/DN, a GC log would help debug issues in NFS gateway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7180) NFSv3 gateway frequently gets stuck

2014-10-20 Thread Brandon Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-7180:
-
Attachment: HDFS-7180.001.patch

> NFSv3 gateway frequently gets stuck
> ---
>
> Key: HDFS-7180
> URL: https://issues.apache.org/jira/browse/HDFS-7180
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.5.0
> Environment: Linux, Fedora 19 x86-64
>Reporter: Eric Zhiqiang Ma
>Assignee: Brandon Li
>Priority: Critical
> Attachments: HDFS-7180.001.patch
>
>
> We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway 
> on one node in the cluster to let users upload data with rsync.
> However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
> seems working well. (hdfds dfs -ls and etc. works just well). The last stuck 
> we found is after around 1 day running and several hundreds GBs of data 
> uploaded.
> The NFSv3 daemon is started on one node and on the same node the NFS is 
> mounted.
> From the node where the NFS is mounted:
> dmsg shows like this:
> [1859245.368108] nfs: server localhost not responding, still trying
> [1859245.368111] nfs: server localhost not responding, still trying
> [1859245.368115] nfs: server localhost not responding, still trying
> [1859245.368119] nfs: server localhost not responding, still trying
> [1859245.368123] nfs: server localhost not responding, still trying
> [1859245.368127] nfs: server localhost not responding, still trying
> [1859245.368131] nfs: server localhost not responding, still trying
> [1859245.368135] nfs: server localhost not responding, still trying
> [1859245.368138] nfs: server localhost not responding, still trying
> [1859245.368142] nfs: server localhost not responding, still trying
> [1859245.368146] nfs: server localhost not responding, still trying
> [1859245.368150] nfs: server localhost not responding, still trying
> [1859245.368153] nfs: server localhost not responding, still trying
> The mounted directory can not be `ls` and `df -hT` gets stuck too.
> The latest lines from the nfs3 log in the hadoop logs directory:
> 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> user map size: 35
> 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> group map size: 54
> 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update 
> cache now
> 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not 
> doing static UID/GID mapping because '/etc/nfs.map' does not exist.
> 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> user map size: 35
> 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> group map size: 54
> 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow 
> ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 
> status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: 
> [10.0.3.172:50010, 10.0.3.176:50010]
> 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: 
> DFSOutputStream ResponseProcessor exception  for block 
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
> java.io.IOException: Bad response ERROR for block 
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 fr

[jira] [Updated] (HDFS-7180) NFSv3 gateway frequently gets stuck

2014-10-20 Thread Brandon Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-7180:
-
Status: Patch Available  (was: Open)

> NFSv3 gateway frequently gets stuck
> ---
>
> Key: HDFS-7180
> URL: https://issues.apache.org/jira/browse/HDFS-7180
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.5.0
> Environment: Linux, Fedora 19 x86-64
>Reporter: Eric Zhiqiang Ma
>Assignee: Brandon Li
>Priority: Critical
> Attachments: HDFS-7180.001.patch
>
>
> We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway 
> on one node in the cluster to let users upload data with rsync.
> However, we find the NFSv3 daemon seems frequently get stuck while the HDFS 
> seems working well. (hdfds dfs -ls and etc. works just well). The last stuck 
> we found is after around 1 day running and several hundreds GBs of data 
> uploaded.
> The NFSv3 daemon is started on one node and on the same node the NFS is 
> mounted.
> From the node where the NFS is mounted:
> dmsg shows like this:
> [1859245.368108] nfs: server localhost not responding, still trying
> [1859245.368111] nfs: server localhost not responding, still trying
> [1859245.368115] nfs: server localhost not responding, still trying
> [1859245.368119] nfs: server localhost not responding, still trying
> [1859245.368123] nfs: server localhost not responding, still trying
> [1859245.368127] nfs: server localhost not responding, still trying
> [1859245.368131] nfs: server localhost not responding, still trying
> [1859245.368135] nfs: server localhost not responding, still trying
> [1859245.368138] nfs: server localhost not responding, still trying
> [1859245.368142] nfs: server localhost not responding, still trying
> [1859245.368146] nfs: server localhost not responding, still trying
> [1859245.368150] nfs: server localhost not responding, still trying
> [1859245.368153] nfs: server localhost not responding, still trying
> The mounted directory can not be `ls` and `df -hT` gets stuck too.
> The latest lines from the nfs3 log in the hadoop logs directory:
> 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> user map size: 35
> 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> group map size: 54
> 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: 
> Have to change stable write to unstable write:FILE_SYNC
> 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update 
> cache now
> 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not 
> doing static UID/GID mapping because '/etc/nfs.map' does not exist.
> 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> user map size: 35
> 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated 
> group map size: 54
> 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow 
> ReadProcessor read fields took 60062ms (threshold=3ms); ack: seqno: -2 
> status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: 
> [10.0.3.172:50010, 10.0.3.176:50010]
> 2014-10-02 06:01:32,099 WARN org.apache.hadoop.hdfs.DFSClient: 
> DFSOutputStream ResponseProcessor exception  for block 
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643
> java.io.IOException: Bad response ERROR for block 
> BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_6236

[jira] [Resolved] (HDFS-5131) Need a DEFAULT-like pipeline recovery policy that works for writers that flush

2014-10-20 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze resolved HDFS-5131.
---
Resolution: Duplicate

Resolving this as a duplicate of HDFS-4257.

> Need a DEFAULT-like pipeline recovery policy that works for writers that flush
> --
>
> Key: HDFS-5131
> URL: https://issues.apache.org/jira/browse/HDFS-5131
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.0.6-alpha
>Reporter: Mike Percy
>Assignee: Tsz Wo Nicholas Sze
>
> The Hadoop 2 pipeline-recovery mechanism currently has four policies: DISABLE 
> (never do recovery), NEVER (never do recovery unless client asks for it), 
> ALWAYS (block until we have recovered the write pipeline to minimum 
> replication levels), and DEFAULT (try to do ALWAYS, but use a heuristic to 
> "give up" and allow writers to continue if not enough datanodes are available 
> to recover the pipeline).
> The big problem with default is that it specifically falls back to ALWAYS 
> behavior if a client calls hflush(). On its face, it seems like a reasonable 
> thing to do, but in practice this means that clients like Flume (as well as, 
> I assume, HBase) just block when the cluster is low on datanodes.
> In order to work around this issue, the easiest thing to do today is set the 
> policy to NEVER when using Flume to write to the cluster. But obviously 
> that's not ideal.
> I believe what clients like Flume need is an additional policy which 
> essentially uses the heuristic logic used by DEFAULT even in cases where 
> long-lived writers call hflush().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7270) Implementing congestion control in writing pipeline

2014-10-20 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-7270:
--
Component/s: datanode
 Issue Type: Improvement  (was: Bug)

> Implementing congestion control in writing pipeline
> ---
>
> Key: HDFS-7270
> URL: https://issues.apache.org/jira/browse/HDFS-7270
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haohui Mai
>Assignee: Haohui Mai
>
> When a client writes to HDFS faster than the disk bandwidth of the DNs, it  
> saturates the disk bandwidth and put the DNs unresponsive. The client only 
> backs off by aborting / recovering the pipeline, which leads to failed writes 
> and unnecessary pipeline recovery.
> This jira proposes to add explicit congestion control mechanisms in the 
> writing pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177735#comment-14177735
 ] 

Hadoop QA commented on HDFS-3342:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675985/HDFS-3342.002.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8464//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8464//console

This message is automatically generated.

> SocketTimeoutException in BlockSender.sendChunks could have a better error 
> message
> --
>
> Key: HDFS-3342
> URL: https://issues.apache.org/jira/browse/HDFS-3342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Todd Lipcon
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch
>
>
> Currently, if a client connects to a DN and begins to read a block, but then 
> stops calling read() for a long period of time, the DN will log a 
> SocketTimeoutException "48 millis timeout while waiting for channel to be 
> ready for write." This is because there is no "keepalive" functionality of 
> any kind. At a minimum, we should improve this error message to be an INFO 
> level log which just says that the client likely stopped reading, so 
> disconnecting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7154) Fix returning value of starting reconfiguration task

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177728#comment-14177728
 ] 

Hadoop QA commented on HDFS-7154:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12672829/HDFS-7154.001.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8460//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8460//console

This message is automatically generated.

> Fix returning value of starting reconfiguration task
> 
>
> Key: HDFS-7154
> URL: https://issues.apache.org/jira/browse/HDFS-7154
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 3.0.0, 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, 
> HDFS-7154.001.patch, HDFS-7154.001.patch
>
>
> Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} 
> (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit 
> code. It is expected to return 0 to indicate success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7254) Add documents for hot swap drive

2014-10-20 Thread Lei (Eddy) Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu updated HDFS-7254:

Attachment: HDFS-7254.001.patch

[~cmccabe] Thanks for your reviews. I have made the changes accordingly.

Could you take another look of the patch?

> Add documents for hot swap drive
> 
>
> Key: HDFS-7254
> URL: https://issues.apache.org/jira/browse/HDFS-7254
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 2.5.1
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7254.000.patch, HDFS-7254.001.patch
>
>
> Add documents for the hot swap drive functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message

2014-10-20 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177704#comment-14177704
 ] 

Yongjun Zhang commented on HDFS-3342:
-

HI [~andrew.wang],

Thanks a lot for the review and comments!

Good catch of yours. Indeed, if user set the log level to WARN, then the new 
message I added won't be seen.  

The "WARN" message was there before I made this change, and it's intended to 
report the stack trace all IOException. The new message I added tried to say 
"Likely the client has stopped reading..".  When there is a 
SocketTimeoutException, I guess there may be other cases of 
SocketTimeoutException than the one we are dealing here. I was worried that 
taking out the WARN message would cause missed reporting of some other cases.  
That's why I used the word "Likely".

To address your comment, I added similar statement to the WARN msg and uploaded 
a new rev (002), so similar msg will be printed at WARN log level. I wonder 
whether it looks good to you.

Thanks again.


> SocketTimeoutException in BlockSender.sendChunks could have a better error 
> message
> --
>
> Key: HDFS-3342
> URL: https://issues.apache.org/jira/browse/HDFS-3342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Todd Lipcon
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch
>
>
> Currently, if a client connects to a DN and begins to read a block, but then 
> stops calling read() for a long period of time, the DN will log a 
> SocketTimeoutException "48 millis timeout while waiting for channel to be 
> ready for write." This is because there is no "keepalive" functionality of 
> any kind. At a minimum, we should improve this error message to be an INFO 
> level log which just says that the client likely stopped reading, so 
> disconnecting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message

2014-10-20 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-3342:

Attachment: HDFS-3342.002.patch

> SocketTimeoutException in BlockSender.sendChunks could have a better error 
> message
> --
>
> Key: HDFS-3342
> URL: https://issues.apache.org/jira/browse/HDFS-3342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Todd Lipcon
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-3342.001.patch, HDFS-3342.002.patch
>
>
> Currently, if a client connects to a DN and begins to read a block, but then 
> stops calling read() for a long period of time, the DN will log a 
> SocketTimeoutException "48 millis timeout while waiting for channel to be 
> ready for write." This is because there is no "keepalive" functionality of 
> any kind. At a minimum, we should improve this error message to be an INFO 
> level log which just says that the client likely stopped reading, so 
> disconnecting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7254) Add documents for hot swap drive

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177691#comment-14177691
 ] 

Colin Patrick McCabe commented on HDFS-7254:


{code}
   DataNode supports hot swappable drives. The user can add or replace HDFS data
{code}

Should be "the Datanode"

{code}
 * The user installs the new hard drives, formats them and mounts them
appropriately. Optional.
{code}

This seems a bit confusing.  Surely formatting and mounting appropriately is 
not optional?  Maybe this should be described as "If there are new storage 
directories, the user should format them and mount them appropriately."

The rest looks good.

> Add documents for hot swap drive
> 
>
> Key: HDFS-7254
> URL: https://issues.apache.org/jira/browse/HDFS-7254
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 2.5.1
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7254.000.patch
>
>
> Add documents for the hot swap drive functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7257) Add the time of last HA state transition to NN's /jmx page

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177679#comment-14177679
 ] 

Hadoop QA commented on HDFS-7257:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675942/HDFS-7257.002.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing
  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8459//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8459//console

This message is automatically generated.

> Add the time of last HA state transition to NN's /jmx page
> --
>
> Key: HDFS-7257
> URL: https://issues.apache.org/jira/browse/HDFS-7257
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7257.001.patch, HDFS-7257.002.patch
>
>
> It would be useful to some monitoring apps to expose the last HA transition 
> time in the NN's /jmx page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7259) Unresponseive NFS mount point due to deferred COMMIT response

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177669#comment-14177669
 ] 

Hadoop QA commented on HDFS-7259:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675970/HDFS-7259.002.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-nfs hadoop-hdfs-project/hadoop-hdfs-nfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8463//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8463//console

This message is automatically generated.

> Unresponseive NFS mount point due to deferred COMMIT response
> -
>
> Key: HDFS-7259
> URL: https://issues.apache.org/jira/browse/HDFS-7259
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.2.0
>Reporter: Brandon Li
>Assignee: Brandon Li
> Attachments: HDFS-7259.001.patch, HDFS-7259.002.patch
>
>
> Since the gateway can't commit random write, it caches the COMMIT requests in 
> a queue and send back response only when the data can be committed or stream 
> timeout (failure in the latter case). This could cause problems two patterns:
> (1) file uploading failure 
> (2) the mount dir is stuck on the same client, but other NFS clients can 
> still access NFS gateway.
> The error pattern (2) is because there are too many COMMIT requests pending, 
> so the NFS client can't send any other requests(e.g., for "ls") to NFS 
> gateway with its pending requests limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7259) Unresponseive NFS mount point due to deferred COMMIT response

2014-10-20 Thread Brandon Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-7259:
-
Attachment: HDFS-7259.002.patch

Uploaded a new patch to fix the unit tests.

> Unresponseive NFS mount point due to deferred COMMIT response
> -
>
> Key: HDFS-7259
> URL: https://issues.apache.org/jira/browse/HDFS-7259
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.2.0
>Reporter: Brandon Li
>Assignee: Brandon Li
> Attachments: HDFS-7259.001.patch, HDFS-7259.002.patch
>
>
> Since the gateway can't commit random write, it caches the COMMIT requests in 
> a queue and send back response only when the data can be committed or stream 
> timeout (failure in the latter case). This could cause problems two patterns:
> (1) file uploading failure 
> (2) the mount dir is stuck on the same client, but other NFS clients can 
> still access NFS gateway.
> The error pattern (2) is because there are too many COMMIT requests pending, 
> so the NFS client can't send any other requests(e.g., for "ls") to NFS 
> gateway with its pending requests limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk

2014-10-20 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177618#comment-14177618
 ] 

Yongjun Zhang commented on HDFS-7235:
-

HI [~cmccabe],

Thanks for the review!  I just uploaded rev 003 to address all the comments.

BTW, about the WATCH-OUT, I was just thinking that someone could add another 
condition in the {{FsDatasetImpl#isValidBlock}} that makes the method to return 
false. But that's remote and probably won't happen.

Thanks again.




> Can not decommission DN which has invalid block due to bad disk
> ---
>
> Key: HDFS-7235
> URL: https://issues.apache.org/jira/browse/HDFS-7235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, 
> HDFS-7235.003.patch
>
>
> When to decommission a DN, the process hangs. 
> What happens is, when NN chooses a replica as a source to replicate data on 
> the to-be-decommissioned DN to other DNs, it favors choosing this DN 
> to-be-decommissioned as the source of transfer (see BlockManager.java).  
> However, because of the bad disk, the DN would detect the source block to be 
> transfered as invalidBlock with the following logic in FsDatasetImpl.java:
> {code}
> /** Does the block exist and have the given state? */
>   private boolean isValid(final ExtendedBlock b, final ReplicaState state) {
> final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), 
> b.getLocalBlock());
> return replicaInfo != null
> && replicaInfo.getState() == state
> && replicaInfo.getBlockFile().exists();
>   }
> {code}
> The reason that this method returns false (detecting invalid block) is 
> because the block file doesn't exist due to bad disk in this case. 
> The key issue we found here is, after DN detects an invalid block for the 
> above reason, it doesn't report the invalid block back to NN, thus NN doesn't 
> know that the block is corrupted, and keeps sending the data transfer request 
> to the same DN to be decommissioned, again and again. This caused an infinite 
> loop, so the decommission process hangs.
> Thanks [~qwertymaniac] for reporting the issue and initial analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-5928) show namespace and namenode ID on NN dfshealth page

2014-10-20 Thread Siqi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177607#comment-14177607
 ] 

Siqi Li commented on HDFS-5928:
---

I have added the check for both namespace and namenodeID

> show namespace and namenode ID on NN dfshealth page
> ---
>
> Key: HDFS-5928
> URL: https://issues.apache.org/jira/browse/HDFS-5928
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siqi Li
>Assignee: Siqi Li
> Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, 
> HDFS-5928.v4.patch, HDFS-5928.v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk

2014-10-20 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7235:

Attachment: HDFS-7235.003.patch

> Can not decommission DN which has invalid block due to bad disk
> ---
>
> Key: HDFS-7235
> URL: https://issues.apache.org/jira/browse/HDFS-7235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, 
> HDFS-7235.003.patch
>
>
> When to decommission a DN, the process hangs. 
> What happens is, when NN chooses a replica as a source to replicate data on 
> the to-be-decommissioned DN to other DNs, it favors choosing this DN 
> to-be-decommissioned as the source of transfer (see BlockManager.java).  
> However, because of the bad disk, the DN would detect the source block to be 
> transfered as invalidBlock with the following logic in FsDatasetImpl.java:
> {code}
> /** Does the block exist and have the given state? */
>   private boolean isValid(final ExtendedBlock b, final ReplicaState state) {
> final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), 
> b.getLocalBlock());
> return replicaInfo != null
> && replicaInfo.getState() == state
> && replicaInfo.getBlockFile().exists();
>   }
> {code}
> The reason that this method returns false (detecting invalid block) is 
> because the block file doesn't exist due to bad disk in this case. 
> The key issue we found here is, after DN detects an invalid block for the 
> above reason, it doesn't report the invalid block back to NN, thus NN doesn't 
> know that the block is corrupted, and keeps sending the data transfer request 
> to the same DN to be decommissioned, again and again. This caused an infinite 
> loop, so the decommission process hangs.
> Thanks [~qwertymaniac] for reporting the issue and initial analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-5928) show namespace and namenode ID on NN dfshealth page

2014-10-20 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated HDFS-5928:
--
Attachment: HDFS-5928.v4.patch

> show namespace and namenode ID on NN dfshealth page
> ---
>
> Key: HDFS-5928
> URL: https://issues.apache.org/jira/browse/HDFS-5928
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siqi Li
>Assignee: Siqi Li
> Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, 
> HDFS-5928.v4.patch, HDFS-5928.v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177600#comment-14177600
 ] 

Hadoop QA commented on HDFS-7266:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675856/hdfs-7266.001.patch
  against trunk revision 8942741.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8458//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8458//console

This message is automatically generated.

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>Priority: Minor
>  Labels: multi-threading
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177582#comment-14177582
 ] 

Hadoop QA commented on HDFS-7221:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675918/HDFS-7221.005.patch
  against trunk revision 8942741.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8457//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8457//console

This message is automatically generated.

> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177575#comment-14177575
 ] 

Charles Lamb commented on HDFS-7221:


I verified that the three test failures are unrelated. TestDNFencing (with and 
without replication) are known failures right now. TestDecommission passes on 
my local machine with the patch applied.

> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7215) Add JvmPauseMonitor to NFS gateway

2014-10-20 Thread Brandon Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177541#comment-14177541
 ] 

Brandon Li commented on HDFS-7215:
--

Thanks, Colin. I've filed HADOOP-11214 to track the effort of adding web UI and 
other metric information. Depends on how much we want to expose to web UI, 
HADOOP-11214 might become an umbrella JIRA. We will see.

> Add JvmPauseMonitor to NFS gateway
> --
>
> Key: HDFS-7215
> URL: https://issues.apache.org/jira/browse/HDFS-7215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: nfs
>Affects Versions: 2.2.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Minor
> Attachments: HDFS-7215.001.patch
>
>
> Like NN/DN, a GC log would help debug issues in NFS gateway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7270) Implementing congestion control in writing pipeline

2014-10-20 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177543#comment-14177543
 ] 

Haohui Mai commented on HDFS-7270:
--

The point of this jira is to make the pipeline more stable and reduce 
unnecessary aborts / recovery. An alternative approach is to implement 
admission control -- HDFS-7265 proposes to introduce a throttler to limit the 
amount of the data that is written into HDFS.

Deriving the right configuration for the throttler to balance between the 
stability and throughput of the pipeline, however, is difficult in practice. 
The loads of the clusters varies from time to time, and the DNs can go ups and 
downs which can make the configuration suboptimal thus defeat its purpose.

> Implementing congestion control in writing pipeline
> ---
>
> Key: HDFS-7270
> URL: https://issues.apache.org/jira/browse/HDFS-7270
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haohui Mai
>Assignee: Haohui Mai
>
> When a client writes to HDFS faster than the disk bandwidth of the DNs, it  
> saturates the disk bandwidth and put the DNs unresponsive. The client only 
> backs off by aborting / recovering the pipeline, which leads to failed writes 
> and unnecessary pipeline recovery.
> This jira proposes to add explicit congestion control mechanisms in the 
> writing pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7270) Implementing congestion control in writing pipeline

2014-10-20 Thread Haohui Mai (JIRA)
Haohui Mai created HDFS-7270:


 Summary: Implementing congestion control in writing pipeline
 Key: HDFS-7270
 URL: https://issues.apache.org/jira/browse/HDFS-7270
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Haohui Mai
Assignee: Haohui Mai


When a client writes to HDFS faster than the disk bandwidth of the DNs, it  
saturates the disk bandwidth and put the DNs unresponsive. The client only 
backs off by aborting / recovering the pipeline, which leads to failed writes 
and unnecessary pipeline recovery.

This jira proposes to add explicit congestion control mechanisms in the writing 
pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7269) NN and DN don't check whether corrupted blocks reported by clients are actually corrupted

2014-10-20 Thread Ming Ma (JIRA)
Ming Ma created HDFS-7269:
-

 Summary: NN and DN don't check whether corrupted blocks reported 
by clients are actually corrupted
 Key: HDFS-7269
 URL: https://issues.apache.org/jira/browse/HDFS-7269
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma


We had a case where the client machine had memory issue and thus failed the 
checksum validation of a given block for all its replicas. So the client ended 
up informing NN about the corrupted blocks for all DNs via reportBadBlocks. 
However, the block isn't corrupted on any of the DNs. You can still use 
DFSClient to read the block. But in order to get rid of NN's warning message 
for corrupt block, we either do a NN fail over, or repair the file via a) copy 
the file somewhere, b) remove the file, c) copy the file back.

It will be useful if NN and DN can validate client's report. In fact, there is 
a comment in NamenodeRpcServer about this.

{noformat}
  /**
   * The client has detected an error on the specified located blocks 
   * and is reporting them to the server.  For now, the namenode will 
   * mark the block as corrupt.  In the future we might 
   * check the blocks are actually corrupt. 
   */
{noformat}

To allow system to recover from invalid client report quickly, we can support 
automatic recovery or manual admins command.

1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN will 
notify the validate result via IBR and new 
ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK.
2. Some new admins command to move corrupted blocks out of BM's 
CorruptReplicasMap and UnderReplicatedBlocks.

Appreciate any input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7154) Fix returning value of starting reconfiguration task

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177528#comment-14177528
 ] 

Colin Patrick McCabe commented on HDFS-7154:


+1.  I am going to re-run Jenkins to get something which looks a little nicer.

> Fix returning value of starting reconfiguration task
> 
>
> Key: HDFS-7154
> URL: https://issues.apache.org/jira/browse/HDFS-7154
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode
>Affects Versions: 3.0.0, 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7154.000.patch, HDFS-7154.001.patch, 
> HDFS-7154.001.patch, HDFS-7154.001.patch
>
>
> Running {{hdfs dfsadmin -reconfig ... start}} mistakenly returns {{-1}} 
> (255). It is due to {{DFSAdmin#startReconfiguration()}} returns wrong exit 
> code. It is expected to return 0 to indicate success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7235) Can not decommission DN which has invalid block due to bad disk

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177518#comment-14177518
 ] 

Colin Patrick McCabe commented on HDFS-7235:


{code}
1787  ReplicaInfo replicaInfo = null;
1788  synchronized(data) {
1789replicaInfo = (ReplicaInfo) data.getReplica( 
block.getBlockPoolId(),
1790block.getBlockId());
1791  }
1792  if (replicaInfo != null 
1793  && replicaInfo.getState() == ReplicaState.FINALIZED 
1794  && !replicaInfo.getBlockFile().exists()) {
{code}
You can't release the lock this way.  Once you release the lock, replicaInfo 
could be mutated at any time.  So you need to do all the check under the lock.

{code}
1795//
1796// Report back to NN bad block caused by non-existent block 
file.
1797// WATCH-OUT: be sure the conditions checked above matches the 
following
1798// method in FsDatasetImpl.java:
1799//   boolean isValidBlock(ExtendedBlock b)
1800// all other conditions need to be true except that 
1801// replicaInfo.getBlockFile().exists() returns false.
1802//
{code}
I don't think we need the "WATCH-OUT" part.  We're not calling 
{{isValidBlock}}, so why do we care if the check is the same as that check?

I generally agree with this approach and I think we can get this in if that's 
fixed.

> Can not decommission DN which has invalid block due to bad disk
> ---
>
> Key: HDFS-7235
> URL: https://issues.apache.org/jira/browse/HDFS-7235
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch
>
>
> When to decommission a DN, the process hangs. 
> What happens is, when NN chooses a replica as a source to replicate data on 
> the to-be-decommissioned DN to other DNs, it favors choosing this DN 
> to-be-decommissioned as the source of transfer (see BlockManager.java).  
> However, because of the bad disk, the DN would detect the source block to be 
> transfered as invalidBlock with the following logic in FsDatasetImpl.java:
> {code}
> /** Does the block exist and have the given state? */
>   private boolean isValid(final ExtendedBlock b, final ReplicaState state) {
> final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), 
> b.getLocalBlock());
> return replicaInfo != null
> && replicaInfo.getState() == state
> && replicaInfo.getBlockFile().exists();
>   }
> {code}
> The reason that this method returns false (detecting invalid block) is 
> because the block file doesn't exist due to bad disk in this case. 
> The key issue we found here is, after DN detects an invalid block for the 
> above reason, it doesn't report the invalid block back to NN, thus NN doesn't 
> know that the block is corrupted, and keeps sending the data transfer request 
> to the same DN to be decommissioned, again and again. This caused an infinite 
> loop, so the decommission process hangs.
> Thanks [~qwertymaniac] for reporting the issue and initial analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177513#comment-14177513
 ] 

Hadoop QA commented on HDFS-7221:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675887/HDFS-7221.004.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
  org.apache.hadoop.hdfs.TestDecommission
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8454//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8454//console

This message is automatically generated.

> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-5928) show namespace and namenode ID on NN dfshealth page

2014-10-20 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177508#comment-14177508
 ] 

Haohui Mai commented on HDFS-5928:
--

It seems that the page might not look right on a non-HA cluster, thus it 
requires a check to disable the output for non-HA clusters.



> show namespace and namenode ID on NN dfshealth page
> ---
>
> Key: HDFS-5928
> URL: https://issues.apache.org/jira/browse/HDFS-5928
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Siqi Li
>Assignee: Siqi Li
> Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, 
> HDFS-5928.v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI

2014-10-20 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177483#comment-14177483
 ] 

Haohui Mai commented on HDFS-6744:
--

I think it might be better to load all the information in the browser, since we 
have to load all information anyway.

We can populate the information to DOM when it is requested -- pagination and 
sorted can be implemented in the same way.

> Improve decommissioning nodes and dead nodes access on the new NN webUI
> ---
>
> Key: HDFS-6744
> URL: https://issues.apache.org/jira/browse/HDFS-6744
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Siqi Li
> Attachments: HDFS-6744.v1.patch, deadnodespage.png, 
> decomnodespage.png, livendoespage.png
>
>
> The new NN webUI lists live node at the top of the page, followed by dead 
> node and decommissioning node. From admins point of view:
> 1. Decommissioning nodes and dead nodes are more interesting. It is better to 
> move decommissioning nodes to the top of the page, followed by dead nodes and 
> decommissioning nodes.
> 2. To find decommissioning nodes or dead nodes, the whole page that includes 
> all nodes needs to be loaded. That could take some time for big clusters.
> The legacy web UI filters out the type of nodes dynamically. That seems to 
> work well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7257) Add the time of last HA state transition to NN's /jmx page

2014-10-20 Thread Charles Lamb (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Lamb updated HDFS-7257:
---
Attachment: HDFS-7257.002.patch

The test failures in the jenkins run were unrelated. TestBalancer passes on my 
local machine with the patch applied.

The .002 patch moves the test to a more appropriate Test...java file.


> Add the time of last HA state transition to NN's /jmx page
> --
>
> Key: HDFS-7257
> URL: https://issues.apache.org/jira/browse/HDFS-7257
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7257.001.patch, HDFS-7257.002.patch
>
>
> It would be useful to some monitoring apps to expose the last HA transition 
> time in the NN's /jmx page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7207) Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7207:
---
Description: We should consider adding a C\+\+ interface for libhdfs, 
libhdfs3, and libwebhdfs.  This interface should not impose unreasonable 
compatibility constraints on the libraries, and should be useful for many C\+\+ 
projects in order to be useful.  We may also want to avoid exceptions because 
some C\+\+ clients do not use them.  (was: There are three major disadvantages 
of exposing exceptions in the public API:

* Exposing exceptions in public APIs forces the downstream users to be compiled 
with {{-fexceptions}}, which might be infeasible in many use cases.
* It forces other bindings to properly handle all C++ exceptions, which might 
be infeasible especially when the binding is generated by tools like SWIG.
* It forces the downstream users to properly handle all C++ exceptions, which 
can be cumbersome as in certain cases it will lead to undefined behavior (e.g., 
throwing an exception in a destructor is undefined.)

)
   Priority: Major  (was: Blocker)
 Issue Type: Improvement  (was: Bug)

> Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs
> ---
>
> Key: HDFS-7207
> URL: https://issues.apache.org/jira/browse/HDFS-7207
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-7207.001.patch
>
>
> We should consider adding a C\+\+ interface for libhdfs, libhdfs3, and 
> libwebhdfs.  This interface should not impose unreasonable compatibility 
> constraints on the libraries, and should be useful for many C\+\+ projects in 
> order to be useful.  We may also want to avoid exceptions because some C\+\+ 
> clients do not use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7207) Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7207:
---
Issue Type: Bug  (was: Sub-task)
Parent: (was: HDFS-6994)

> Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs
> ---
>
> Key: HDFS-7207
> URL: https://issues.apache.org/jira/browse/HDFS-7207
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haohui Mai
>Assignee: Colin Patrick McCabe
>Priority: Blocker
> Attachments: HDFS-7207.001.patch
>
>
> There are three major disadvantages of exposing exceptions in the public API:
> * Exposing exceptions in public APIs forces the downstream users to be 
> compiled with {{-fexceptions}}, which might be infeasible in many use cases.
> * It forces other bindings to properly handle all C++ exceptions, which might 
> be infeasible especially when the binding is generated by tools like SWIG.
> * It forces the downstream users to properly handle all C++ exceptions, which 
> can be cumbersome as in certain cases it will lead to undefined behavior 
> (e.g., throwing an exception in a destructor is undefined.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177426#comment-14177426
 ] 

Colin Patrick McCabe commented on HDFS-7207:


bq. As you mentioned, exposing \[shared_ptr\] might force the users to run 
tools like valgrind to detect leaks. It is impractical to use valgrind in many 
real-world use cases – valgrind can easily slows the program down for 20x. See 
http://groups.csail.mit.edu/commit/papers/2011/bruening-cgo11-drmemory.pdf

I believe that using {{shared_ptr}} can reduce the frequency of memory leaks in 
many scenarios, such as this one.  Avoiding memory leaks is one reason to use 
{{shared_ptr}}, in fact.  Please do not forget that the C interface can 
generate memory leaks as well.

bq. Though I prefer to having a native C\+\+ interface, for the first cut I 
think it is fine to implement it using the C interface and to declare the 
interface as unstable. On the other hand, however, I think we also need to 
clean up the interface a little bit to make it more usable for C++ users.

I agree.  Let's move this JIRA out of the HDFS-6994 branch and consider it 
later.  Adding a new API requires a lot of discussion and care, and should be 
done for all our interface libraries, not just for one.  We should focus 
HDFS-6994 on getting libhdfs3 into a usable state.

> libhdfs3 should not expose exceptions in public C++ API
> ---
>
> Key: HDFS-7207
> URL: https://issues.apache.org/jira/browse/HDFS-7207
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Haohui Mai
>Assignee: Colin Patrick McCabe
>Priority: Blocker
> Attachments: HDFS-7207.001.patch
>
>
> There are three major disadvantages of exposing exceptions in the public API:
> * Exposing exceptions in public APIs forces the downstream users to be 
> compiled with {{-fexceptions}}, which might be infeasible in many use cases.
> * It forces other bindings to properly handle all C++ exceptions, which might 
> be infeasible especially when the binding is generated by tools like SWIG.
> * It forces the downstream users to properly handle all C++ exceptions, which 
> can be cumbersome as in certain cases it will lead to undefined behavior 
> (e.g., throwing an exception in a destructor is undefined.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7207) Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7207:
---
Summary: Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs  
(was: libhdfs3 should not expose exceptions in public C++ API)

> Consider adding a C++ API for libhdfs, libhdfs3, and libwebhdfs
> ---
>
> Key: HDFS-7207
> URL: https://issues.apache.org/jira/browse/HDFS-7207
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Haohui Mai
>Assignee: Colin Patrick McCabe
>Priority: Blocker
> Attachments: HDFS-7207.001.patch
>
>
> There are three major disadvantages of exposing exceptions in the public API:
> * Exposing exceptions in public APIs forces the downstream users to be 
> compiled with {{-fexceptions}}, which might be infeasible in many use cases.
> * It forces other bindings to properly handle all C++ exceptions, which might 
> be infeasible especially when the binding is generated by tools like SWIG.
> * It forces the downstream users to properly handle all C++ exceptions, which 
> can be cumbersome as in certain cases it will lead to undefined behavior 
> (e.g., throwing an exception in a destructor is undefined.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177427#comment-14177427
 ] 

Yongjun Zhang commented on HDFS-7221:
-

Thanks Charles and Ming, the latest patch  LGTM too.


> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7227) Fix findbugs warning about NP_DEREFERENCE_OF_READLINE_VALUE in SpanReceiverHost

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177417#comment-14177417
 ] 

Colin Patrick McCabe commented on HDFS-7227:


bq. Tsuyoshi wrote: Hi Colin Patrick McCabe, Java coding style says that we 
should avoid emitting braces:

Right.  That's why I commented that "I thought there was some text in there 
about short "if" statements being OK to do on one line, but I don't see it in 
the guide."

bq. stack wrote: Patch LGTM +1.

Can I get another +1 on this?  Since we're being pedantic :)

It's clear that the findbugs warning in AbstractDelegationTokenSecretManager is 
not related, since this patch doesn't change that.

> Fix findbugs warning about NP_DEREFERENCE_OF_READLINE_VALUE in 
> SpanReceiverHost
> ---
>
> Key: HDFS-7227
> URL: https://issues.apache.org/jira/browse/HDFS-7227
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Attachments: HDFS-7227.001.patch, HDFS-7227.002.patch
>
>
> Fix findbugs warning about NP_DEREFERENCE_OF_READLINE_VALUE in 
> SpanReceiverHost



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7184) Allow data migration tool to run as a daemon

2014-10-20 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177416#comment-14177416
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7184:
---

Hi Benoy, let's also merge this to 2.6 where the mover script is firstly 
introduced?

> Allow data migration tool to run as a daemon
> 
>
> Key: HDFS-7184
> URL: https://issues.apache.org/jira/browse/HDFS-7184
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: balancer & mover, scripts
>Affects Versions: 3.0.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-7184.patch, HDFS-7184.patch
>
>
> Just like balancer, it is sometimes required to run data migration tool in a 
> daemon mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7215) Add JvmPauseMonitor to NFS gateway

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177413#comment-14177413
 ] 

Colin Patrick McCabe commented on HDFS-7215:


Looks good to me.  Are you going to add a way to retrieve the JvmMetrics from 
the NFS gateway web UI, like {{DataNodeMetrics#getJvmMetrics}}?  We could also 
file a follow-on JIRA to do that if that's more convenient.

> Add JvmPauseMonitor to NFS gateway
> --
>
> Key: HDFS-7215
> URL: https://issues.apache.org/jira/browse/HDFS-7215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: nfs
>Affects Versions: 2.2.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Minor
> Attachments: HDFS-7215.001.patch
>
>
> Like NN/DN, a GC log would help debug issues in NFS gateway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7165) Separate block metrics for files with replication count 1

2014-10-20 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177400#comment-14177400
 ] 

Andrew Wang commented on HDFS-7165:
---

Almost there, thanks for revving Zhe.

* In ClientProtocol#getStats, it mentions "total used space of the block pool", 
and I see that being set in HeartbeatManager, but AFAICT it's dropped in the PB 
layer on the server side. If it's not being used, let's remove it. If not, it's 
a compat issue to insert something at an already-being-used index of the stats 
array.
* TestMissingBlocksAlert still has a whitespace-only change. Line 79-80 were 
deleted.
* TestUnderReplicatedBlockQueues, the extend:

{code}
public class TestUnderReplicatedBlockQueues extends Assert {
{code}

We should not "extends Assert" in test cases. Instead, let's add static imports 
on the various Asserts being used. Let's undo the assertInLevel changes too, 
using {{fail}} as it was before was good.

> Separate block metrics for files with replication count 1
> -
>
> Key: HDFS-7165
> URL: https://issues.apache.org/jira/browse/HDFS-7165
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Andrew Wang
>Assignee: Zhe Zhang
> Attachments: HDFS-7165-20141003-v1.patch, 
> HDFS-7165-20141009-v1.patch, HDFS-7165-20141010-v1.patch, 
> HDFS-7165-20141015-v1.patch
>
>
> We see a lot of escalations because someone has written teragen output with a 
> replication factor of 1, a DN goes down, and a bunch of missing blocks show 
> up. These are normally false positives, since teragen output is disposable, 
> and generally speaking, users should understand this is true for all repl=1 
> files.
> It'd be nice to be able to separate out these repl=1 missing blocks from 
> missing blocks with higher replication factors..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7207) libhdfs3 should not expose exceptions in public C++ API

2014-10-20 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177388#comment-14177388
 ] 

Haohui Mai commented on HDFS-7207:
--

Though I prefer to having a native C\+\+ interface, for the first cut I think 
it is fine to implement it using the C interface and to declare the interface 
as unstable. On the other hand, however, I think we also need to clean up the 
interface a little bit to make it more usable for C\+\+ users.

> libhdfs3 should not expose exceptions in public C++ API
> ---
>
> Key: HDFS-7207
> URL: https://issues.apache.org/jira/browse/HDFS-7207
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Haohui Mai
>Assignee: Colin Patrick McCabe
>Priority: Blocker
> Attachments: HDFS-7207.001.patch
>
>
> There are three major disadvantages of exposing exceptions in the public API:
> * Exposing exceptions in public APIs forces the downstream users to be 
> compiled with {{-fexceptions}}, which might be infeasible in many use cases.
> * It forces other bindings to properly handle all C++ exceptions, which might 
> be infeasible especially when the binding is generated by tools like SWIG.
> * It forces the downstream users to properly handle all C++ exceptions, which 
> can be cumbersome as in certain cases it will lead to undefined behavior 
> (e.g., throwing an exception in a destructor is undefined.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7264) Tha last datanode in a pipeline should send a heartbeat when there is no traffic

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177386#comment-14177386
 ] 

Hadoop QA commented on HDFS-7264:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675891/h7264_20141020.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8455//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8455//console

This message is automatically generated.

> Tha last datanode in a pipeline should send a heartbeat when there is no 
> traffic
> 
>
> Key: HDFS-7264
> URL: https://issues.apache.org/jira/browse/HDFS-7264
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7264_20141017.patch, h7264_20141020.patch
>
>
> When the client is writing slowly, the client will send a heartbeat to signal 
> that the connection is still alive.  This case works fine.
> However, when a client is writing fast but some of the datanodes in the 
> pipeline are busy, a PacketResponder may get a timeout since no ack is sent 
> from the upstream datanode.  We suggest that the last datanode in a pipeline 
> should send a heartbeat when there is no traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7266:
---
Status: Patch Available  (was: Open)

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.6.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>  Labels: multi-threading
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7266:
---
 Priority: Minor  (was: Major)
Affects Version/s: (was: 2.6.0)
   2.7.0
   Issue Type: Improvement  (was: Bug)

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>Priority: Minor
>  Labels: multi-threading
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177385#comment-14177385
 ] 

Colin Patrick McCabe commented on HDFS-7266:


Pending jenkins, of course

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>  Labels: multi-threading
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7266) HDFS Peercache enabled check should not lock on object

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177384#comment-14177384
 ] 

Colin Patrick McCabe commented on HDFS-7266:


+1.  Thanks, Andrew and Gopal.

> HDFS Peercache enabled check should not lock on object
> --
>
> Key: HDFS-7266
> URL: https://issues.apache.org/jira/browse/HDFS-7266
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.6.0
>Reporter: Gopal V
>Assignee: Andrew Wang
>  Labels: multi-threading
> Attachments: dfs-open-10-threads.png, hdfs-7266.001.patch
>
>
> HDFS fs.Open synchronizes on the Peercache, even when peer cache is disabled.
> {code}
>  public synchronized Peer get(DatanodeID dnId, boolean isDomain) {
> if (capacity <= 0) { // disabled
>   return null;
> }
> {code}
> since capacity is a final, this could be moved outside the lock.
> !dfs-open-10-threads.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177375#comment-14177375
 ] 

Ming Ma commented on HDFS-7221:
---

Thanks, Charles. The latest patch LGTM.

> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Charles Lamb (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Lamb updated HDFS-7221:
---
Attachment: HDFS-7221.005.patch

[~mingma],

Yes, aesthetically that is better. I've changed that in the .005 version.

Thanks for the review.


> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177354#comment-14177354
 ] 

Ming Ma commented on HDFS-7221:
---

Thanks, Charles. It shouldn't change the test result either way, but it is 
better if dfs.namenode.replication.max-streams is set to 16 as well. Otherwise, 
others might wonder dfs.namenode.replication.max-streams is set to much larger 
value.

> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7184) Allow data migration tool to run as a daemon

2014-10-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177347#comment-14177347
 ] 

Hudson commented on HDFS-7184:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6292 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6292/])
HDFS-7184. Allow data migration tool to run as a daemon. (Benoy Antony) (benoy: 
rev e4d6a878541cc07fada2bd07dedc4740570a472e)
* hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs


> Allow data migration tool to run as a daemon
> 
>
> Key: HDFS-7184
> URL: https://issues.apache.org/jira/browse/HDFS-7184
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: balancer & mover, scripts
>Affects Versions: 3.0.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-7184.patch, HDFS-7184.patch
>
>
> Just like balancer, it is sometimes required to run data migration tool in a 
> daemon mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7204) balancer doesn't run as a daemon

2014-10-20 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177344#comment-14177344
 ] 

Benoy Antony commented on HDFS-7204:


+1

> balancer doesn't run as a daemon
> 
>
> Key: HDFS-7204
> URL: https://issues.apache.org/jira/browse/HDFS-7204
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: scripts
>Affects Versions: 3.0.0
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
>Priority: Blocker
>  Labels: newbie
> Attachments: HDFS-7204-01.patch, HDFS-7204.patch
>
>
> From HDFS-7184, minor issues with balancer:
> * daemon isn't set to true in hdfs to enable daemonization
> * start-balancer script has usage instead of hadoop_usage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7184) Allow data migration tool to run as a daemon

2014-10-20 Thread Benoy Antony (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated HDFS-7184:
---
 Target Version/s: 3.0.0
Affects Version/s: 3.0.0

> Allow data migration tool to run as a daemon
> 
>
> Key: HDFS-7184
> URL: https://issues.apache.org/jira/browse/HDFS-7184
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: balancer & mover, scripts
>Affects Versions: 3.0.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-7184.patch, HDFS-7184.patch
>
>
> Just like balancer, it is sometimes required to run data migration tool in a 
> daemon mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7184) Allow data migration tool to run as a daemon

2014-10-20 Thread Benoy Antony (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated HDFS-7184:
---
  Resolution: Fixed
   Fix Version/s: 3.0.0
Target Version/s:   (was: 2.6.0)
  Status: Resolved  (was: Patch Available)

committed to trunk.

> Allow data migration tool to run as a daemon
> 
>
> Key: HDFS-7184
> URL: https://issues.apache.org/jira/browse/HDFS-7184
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: balancer & mover, scripts
>Reporter: Benoy Antony
>Assignee: Benoy Antony
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-7184.patch, HDFS-7184.patch
>
>
> Just like balancer, it is sometimes required to run data migration tool in a 
> daemon mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7218) FSNamesystem ACL operations should write to audit log on failure

2014-10-20 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177335#comment-14177335
 ] 

Charles Lamb commented on HDFS-7218:


The two test failures are unrelated.


> FSNamesystem ACL operations should write to audit log on failure
> 
>
> Key: HDFS-7218
> URL: https://issues.apache.org/jira/browse/HDFS-7218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7218.001.patch, HDFS-7218.002.patch, 
> HDFS-7218.003.patch
>
>
> Various Acl methods in FSNamesystem do not write to the audit log when the 
> operation is not successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Charles Lamb (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177332#comment-14177332
 ] 

Charles Lamb commented on HDFS-7221:


TestDNFencing is known to fail lately. TestInterDatanodeProtocol runs ok on my 
local machine with the patch applied.


> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177319#comment-14177319
 ] 

Hadoop QA commented on HDFS-7221:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675869/HDFS-7221.003.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing
  
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestInterDatanodeProtocol

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8453//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8453//console

This message is automatically generated.

> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7211) Block invalidation work should be ordered

2014-10-20 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177311#comment-14177311
 ] 

Andrew Wang commented on HDFS-7211:
---

Maybe LightWeightLinkedSet?

> Block invalidation work should be ordered
> -
>
> Key: HDFS-7211
> URL: https://issues.apache.org/jira/browse/HDFS-7211
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>
> {{InvalidateBlocks#node2blocks}} uses an unordered {{LightWeightHashSet}} to 
> store blocks (to be invalidated) on the same DN. This causes poor ordering 
> when a single DN has a large number of blocks to invalidate. Blocks should be 
> invalidated following the order of invalidation commands -- at least roughly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7225) Failed DataNode lookup can crash NameNode with NullPointerException

2014-10-20 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177307#comment-14177307
 ] 

Andrew Wang commented on HDFS-7225:
---

Nice examination here Zhe. One high-level question though, could we simplify 
the above by cleaning InvalidateBlocks immediately upon seeing the new 
datanodeUuid? If the old volume is brought back, the old blocks will be in the 
block report and the NN will re-populate InvalidateBlocks as needed when it 
processes the report.

> Failed DataNode lookup can crash NameNode with NullPointerException
> ---
>
> Key: HDFS-7225
> URL: https://issues.apache.org/jira/browse/HDFS-7225
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7225-v1.patch
>
>
> {{BlockManager#invalidateWorkForOneNode}} looks up a DataNode by the 
> {{datanodeUuid}} and passes the resultant {{DatanodeDescriptor}} to 
> {{InvalidateBlocks#invalidateWork}}. However, if a wrong or outdated 
> {{datanodeUuid}} is used, a null pointer will be passed to {{invalidateWork}} 
> which will use it to lookup in a {{TreeMap}}. Since the key type is 
> {{DatanodeDescriptor}}, key comparison is based on the IP address. A null key 
> will crash the NameNode with an NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7257) Add the time of last HA state transition to NN's /jmx page

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177289#comment-14177289
 ] 

Hadoop QA commented on HDFS-7257:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675855/HDFS-7257.001.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1265 javac 
compiler warnings (more than the trunk's current 1264 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing
  org.apache.hadoop.hdfs.server.balancer.TestBalancer

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestHdfsAdmin

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8451//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8451//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8451//console

This message is automatically generated.

> Add the time of last HA state transition to NN's /jmx page
> --
>
> Key: HDFS-7257
> URL: https://issues.apache.org/jira/browse/HDFS-7257
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7257.001.patch
>
>
> It would be useful to some monitoring apps to expose the last HA transition 
> time in the NN's /jmx page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7218) FSNamesystem ACL operations should write to audit log on failure

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177290#comment-14177290
 ] 

Hadoop QA commented on HDFS-7218:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675861/HDFS-7218.003.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestHdfsAdmin

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8452//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8452//console

This message is automatically generated.

> FSNamesystem ACL operations should write to audit log on failure
> 
>
> Key: HDFS-7218
> URL: https://issues.apache.org/jira/browse/HDFS-7218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7218.001.patch, HDFS-7218.002.patch, 
> HDFS-7218.003.patch
>
>
> Various Acl methods in FSNamesystem do not write to the audit log when the 
> operation is not successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7244) Reduce Namenode memory using Flyweight pattern

2014-10-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177280#comment-14177280
 ] 

Colin Patrick McCabe commented on HDFS-7244:


It's exciting to see progress on this, [~langera]!

There are a few questions we need to figure out here.  One is fallback... if 
{{ByteBuffer#allocDirect}} is not available on the JVM, what do we do?  In my 
earlier patch, I simply used {{ByteBuffer#alloc}}.  I still like this approach, 
but it does mean we can't chase raw pointers when implementing off-heap data 
structures.  I was trying to address this by using \{ 32-bit slab ID, 32-bit 
slab offset \} tuples instead.  This does require that we do a lookup in a 
{{map}} whenever we chase a "pointer", though.

Another approach to fallback is to use raw pointers if they're available, and 
\{ slabID, offset\} tuples if they're not.  This is faster for the common case 
of true off-heaping.  The complication here is that theoretically one 
{{allocDirect}} calls could fail while another succeeds.  If we did this, we'd 
probably want to create a configuration key like {{hadoop.use.off.heap}}, and 
throw a hard failure whenever this was {{true}} but {{allocDirect}} failed.

What data structures are you planning on using to look up block data in the NN? 
 I was considering an off-heap hash map implementation.

If you look at the requirements for our BlocksMap, we need:
* fast lookup of \{ 64-bit blockID, string bpId \} to yield all DNs where this 
block is replicated
* ability to iterate over all blocks which a DN holds

#1 is not too difficult, but #2 could be tricky.  The obvious solution is just 
to have a hash map from \{ blockID, bpID \} to a node structure which is a 
member of a few implicit linked lists.  This does mean the node structure has 
variable size, which could be challenging to implement (It's basically the 
{{malloc}} problem).  There isn't any upper limit on the number of DNs a block 
can be on.

A better way might be to have a hash map from \{ blockID, bpID, replicaIndex \} 
so that we avoid implicit linked lists.  So to find the first replica for 
BlockID 123 in bpID "foo", you look up (123, foo, 0)... the second, (123, foo, 
1), and so forth.

This also raises a few questions.
* should we create a lookup table for bpids?  We clearly don't want to store 
the string everywhere, and we can't use Java string interning when doing 
off-heap.  A 16-bit or 32-bit lookup table from string bpid -> bpid index would 
certainly slim this down.
* similar for DNs... how do we identify them?  The storage ID is too long to be 
practical.  The simplest way would be a 64-bit ID where we didn't reuse any 
indices.  If we have 32-bit or less DN IDs we'll have to figure out some 
garbage collection strategy, which could be tricky.

Do you think we'll need a branch for this?  I don't have a feeling yet for how 
incremental it is.  Clearly adding the Slab code can be done in trunk without 
destabilizing anything else.  I'm not as clear on how difficult the other 
subtasks are going to be to do in an "incremental" way.

Do you have some code using the Slab code yet?  It might be hard to know 
exactly what API we want for Slab until we see how it works in action.  Of 
course we can always modify it later, but posting a combined patch would give 
me a better feel for it.

> Reduce Namenode memory using Flyweight pattern
> --
>
> Key: HDFS-7244
> URL: https://issues.apache.org/jira/browse/HDFS-7244
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Amir Langer
>
> Using the flyweight pattern can dramatically reduce memory usage in the 
> Namenode. The pattern also abstracts the actual storage type and allows the 
> decision of whether it is off-heap or not and what is the serialisation 
> mechanism to be configured per deployment. 
> The idea is to move all BlockInfo data (as a first step) to this storage 
> using the Flyweight pattern. The cost to doing it will be in higher latency 
> when accessing/modifying a block. The idea is that this will be offset with a 
> reduction in memory and in the case of off-heap, a dramatic reduction in 
> memory (effectively, memory used for BlockInfo would reduce to a very small 
> constant value).
> This reduction will also have an huge impact on the latency as GC pauses will 
> be reduced considerably and may even end up with better latency results than 
> the original code.
> I wrote a stand-alone project as a proof of concept, to show the pattern, the 
> data structure we can use and what will be the performance costs of this 
> approach.
> see [Slab|https://github.com/langera/slab]
> and [Slab performance 
> results|https://github.com/langera/slab/wiki/Performance-Results].
> Slab abstracts the storage, gives several s

[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI

2014-10-20 Thread Siqi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177262#comment-14177262
 ] 

Siqi Li commented on HDFS-6744:
---

I have attached 3 screenshots of each page(livenodes, deadnodes, decomnodes)

> Improve decommissioning nodes and dead nodes access on the new NN webUI
> ---
>
> Key: HDFS-6744
> URL: https://issues.apache.org/jira/browse/HDFS-6744
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Siqi Li
> Attachments: HDFS-6744.v1.patch, deadnodespage.png, 
> decomnodespage.png, livendoespage.png
>
>
> The new NN webUI lists live node at the top of the page, followed by dead 
> node and decommissioning node. From admins point of view:
> 1. Decommissioning nodes and dead nodes are more interesting. It is better to 
> move decommissioning nodes to the top of the page, followed by dead nodes and 
> decommissioning nodes.
> 2. To find decommissioning nodes or dead nodes, the whole page that includes 
> all nodes needs to be loaded. That could take some time for big clusters.
> The legacy web UI filters out the type of nodes dynamically. That seems to 
> work well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI

2014-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177261#comment-14177261
 ] 

Hadoop QA commented on HDFS-6744:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675894/decomnodespage.png
  against trunk revision d5084b9.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8456//console

This message is automatically generated.

> Improve decommissioning nodes and dead nodes access on the new NN webUI
> ---
>
> Key: HDFS-6744
> URL: https://issues.apache.org/jira/browse/HDFS-6744
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Siqi Li
> Attachments: HDFS-6744.v1.patch, deadnodespage.png, 
> decomnodespage.png, livendoespage.png
>
>
> The new NN webUI lists live node at the top of the page, followed by dead 
> node and decommissioning node. From admins point of view:
> 1. Decommissioning nodes and dead nodes are more interesting. It is better to 
> move decommissioning nodes to the top of the page, followed by dead nodes and 
> decommissioning nodes.
> 2. To find decommissioning nodes or dead nodes, the whole page that includes 
> all nodes needs to be loaded. That could take some time for big clusters.
> The legacy web UI filters out the type of nodes dynamically. That seems to 
> work well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI

2014-10-20 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated HDFS-6744:
--
Attachment: decomnodespage.png
deadnodespage.png
livendoespage.png

> Improve decommissioning nodes and dead nodes access on the new NN webUI
> ---
>
> Key: HDFS-6744
> URL: https://issues.apache.org/jira/browse/HDFS-6744
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Siqi Li
> Attachments: HDFS-6744.v1.patch, deadnodespage.png, 
> decomnodespage.png, livendoespage.png
>
>
> The new NN webUI lists live node at the top of the page, followed by dead 
> node and decommissioning node. From admins point of view:
> 1. Decommissioning nodes and dead nodes are more interesting. It is better to 
> move decommissioning nodes to the top of the page, followed by dead nodes and 
> decommissioning nodes.
> 2. To find decommissioning nodes or dead nodes, the whole page that includes 
> all nodes needs to be loaded. That could take some time for big clusters.
> The legacy web UI filters out the type of nodes dynamically. That seems to 
> work well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7264) Tha last datanode in a pipeline should send a heartbeat when there is no traffic

2014-10-20 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-7264:
--
Attachment: h7264_20141020.patch

h7264_20141020.patch: fixes the typos.

> Tha last datanode in a pipeline should send a heartbeat when there is no 
> traffic
> 
>
> Key: HDFS-7264
> URL: https://issues.apache.org/jira/browse/HDFS-7264
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7264_20141017.patch, h7264_20141020.patch
>
>
> When the client is writing slowly, the client will send a heartbeat to signal 
> that the connection is still alive.  This case works fine.
> However, when a client is writing fast but some of the datanodes in the 
> pipeline are busy, a PacketResponder may get a timeout since no ack is sent 
> from the upstream datanode.  We suggest that the last datanode in a pipeline 
> should send a heartbeat when there is no traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7211) Block invalidation work should be ordered

2014-10-20 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-7211:
--
  Component/s: namenode
 Target Version/s: 2.7.0
Affects Version/s: 2.5.1

> Block invalidation work should be ordered
> -
>
> Key: HDFS-7211
> URL: https://issues.apache.org/jira/browse/HDFS-7211
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>
> {{InvalidateBlocks#node2blocks}} uses an unordered {{LightWeightHashSet}} to 
> store blocks (to be invalidated) on the same DN. This causes poor ordering 
> when a single DN has a large number of blocks to invalidate. Blocks should be 
> invalidated following the order of invalidation commands -- at least roughly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7264) Tha last datanode in a pipeline should send a heartbeat when there is no traffic

2014-10-20 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177248#comment-14177248
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7264:
---

Hi Vinay, thanks for reviewing the patch.

> Why can't heartbeat be enabled always.. without configuration flag, which is 
> disabled by default. ?

It is for rolling upgrade.  We have to disable the feature first, upgrade, and 
then enable the feature.  Otherwise, the old software cannot handle the new 
heartbeat.

> Tha last datanode in a pipeline should send a heartbeat when there is no 
> traffic
> 
>
> Key: HDFS-7264
> URL: https://issues.apache.org/jira/browse/HDFS-7264
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7264_20141017.patch
>
>
> When the client is writing slowly, the client will send a heartbeat to signal 
> that the connection is still alive.  This case works fine.
> However, when a client is writing fast but some of the datanodes in the 
> pipeline are busy, a PacketResponder may get a timeout since no ack is sent 
> from the upstream datanode.  We suggest that the last datanode in a pipeline 
> should send a heartbeat when there is no traffic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI

2014-10-20 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177240#comment-14177240
 ] 

Haohui Mai commented on HDFS-6744:
--

[~l201514], can you please post a screenshot? Thanks.

> Improve decommissioning nodes and dead nodes access on the new NN webUI
> ---
>
> Key: HDFS-6744
> URL: https://issues.apache.org/jira/browse/HDFS-6744
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Siqi Li
> Attachments: HDFS-6744.v1.patch
>
>
> The new NN webUI lists live node at the top of the page, followed by dead 
> node and decommissioning node. From admins point of view:
> 1. Decommissioning nodes and dead nodes are more interesting. It is better to 
> move decommissioning nodes to the top of the page, followed by dead nodes and 
> decommissioning nodes.
> 2. To find decommissioning nodes or dead nodes, the whole page that includes 
> all nodes needs to be loaded. That could take some time for big clusters.
> The legacy web UI filters out the type of nodes dynamically. That seems to 
> work well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-10-20 Thread Charles Lamb (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Lamb updated HDFS-7221:
---
Attachment: HDFS-7221.004.patch

[~mingma],

Thanks for the review. That seems like a good idea. The .004 patch moves the 
setting to HAStressTestHarness.

We can see if the jenkins run blows anything up.


> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3342) SocketTimeoutException in BlockSender.sendChunks could have a better error message

2014-10-20 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177215#comment-14177215
 ] 

Andrew Wang commented on HDFS-3342:
---

Hi Yongjun, thanks for working on this,

Looking at the new output you posted, it looks like it quashes the ERROR log, 
but we still end up with 3 log prints for the same issue, and one is still at 
WARN. Wouldn't an ideal solution print just a single log message at INFO? Also 
note that if someone has the log level set to WARN (happens in production 
deployments), they'll see the scary stack trace but not the new log print you 
added. It'd also be nice to not have stack trace spam in this situation, since 
it's somewhat expected.

LMK what you think, thanks again.

> SocketTimeoutException in BlockSender.sendChunks could have a better error 
> message
> --
>
> Key: HDFS-3342
> URL: https://issues.apache.org/jira/browse/HDFS-3342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Todd Lipcon
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-3342.001.patch
>
>
> Currently, if a client connects to a DN and begins to read a block, but then 
> stops calling read() for a long period of time, the DN will log a 
> SocketTimeoutException "48 millis timeout while waiting for channel to be 
> ready for write." This is because there is no "keepalive" functionality of 
> any kind. At a minimum, we should improve this error message to be an INFO 
> level log which just says that the client likely stopped reading, so 
> disconnecting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7265) Use a throttler for replica write in datanode

2014-10-20 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177164#comment-14177164
 ] 

Haohui Mai commented on HDFS-7265:
--

I found that it is better to throttle dynamically instead of throttling on a 
pre-defined bandwidth. Other workloads in the clusters
can dramatically impact the disk utilization, thus it is quite difficult to 
come up with a configuration that can protect the DNs from being overloaded but 
also saturating the peak throughput.

> Use a throttler for replica write in datanode
> -
>
> Key: HDFS-7265
> URL: https://issues.apache.org/jira/browse/HDFS-7265
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7265_20141018.patch
>
>
> BlockReceiver process packets in BlockReceiver.receivePacket() as follows
> # read from socket
> # enqueue the ack
> # write to downstream
> # write to disk
> The above steps is repeated for each packet in a single thread.  When there 
> are a lot of concurrent writes in a datanode, the write time in #4 becomes 
> very long.  As a result, it leads to SocketTimeoutException since it cannot 
> read from the socket for a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >