[jira] [Created] (HDFS-9054) Race condition on yieldCount in FSDirectory.java

2015-09-11 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-9054:
--

 Summary: Race condition on yieldCount in FSDirectory.java
 Key: HDFS-9054
 URL: https://issues.apache.org/jira/browse/HDFS-9054
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor


getContentSummaryInt only held read lock, and it called fsd.addYieldCount which 
may cause race condition:
{code}
  private static ContentSummary getContentSummaryInt(FSDirectory fsd,
  INodesInPath iip) throws IOException {
fsd.readLock();
try {
  INode targetNode = iip.getLastINode();
  if (targetNode == null) {
throw new FileNotFoundException("File does not exist: " + 
iip.getPath());
  }
  else {
// Make it relinquish locks everytime contentCountLimit entries are
// processed. 0 means disabled. I.e. blocking for the entire duration.
ContentSummaryComputationContext cscc =
new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(),
fsd.getContentCountLimit(), fsd.getContentSleepMicroSec());
ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc);
fsd.addYieldCount(cscc.getYieldCount());
return cs;
  }
} finally {
  fsd.readUnlock();
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9054) Race condition on yieldCount in FSDirectory.java

2015-09-11 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-9054:
---
Attachment: HDFS-9054.patch

This patch use AtomicLong to prevent race condition.

> Race condition on yieldCount in FSDirectory.java
> 
>
> Key: HDFS-9054
> URL: https://issues.apache.org/jira/browse/HDFS-9054
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-9054.patch
>
>
> getContentSummaryInt only held read lock, and it called fsd.addYieldCount 
> which may cause race condition:
> {code}
>   private static ContentSummary getContentSummaryInt(FSDirectory fsd,
>   INodesInPath iip) throws IOException {
> fsd.readLock();
> try {
>   INode targetNode = iip.getLastINode();
>   if (targetNode == null) {
> throw new FileNotFoundException("File does not exist: " + 
> iip.getPath());
>   }
>   else {
> // Make it relinquish locks everytime contentCountLimit entries are
> // processed. 0 means disabled. I.e. blocking for the entire duration.
> ContentSummaryComputationContext cscc =
> new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(),
> fsd.getContentCountLimit(), fsd.getContentSleepMicroSec());
> ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc);
> fsd.addYieldCount(cscc.getYieldCount());
> return cs;
>   }
> } finally {
>   fsd.readUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9054) Race condition on yieldCount in FSDirectory.java

2015-09-11 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-9054:
---
Affects Version/s: 2.7.0
   Status: Patch Available  (was: Open)

> Race condition on yieldCount in FSDirectory.java
> 
>
> Key: HDFS-9054
> URL: https://issues.apache.org/jira/browse/HDFS-9054
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-9054.patch
>
>
> getContentSummaryInt only held read lock, and it called fsd.addYieldCount 
> which may cause race condition:
> {code}
>   private static ContentSummary getContentSummaryInt(FSDirectory fsd,
>   INodesInPath iip) throws IOException {
> fsd.readLock();
> try {
>   INode targetNode = iip.getLastINode();
>   if (targetNode == null) {
> throw new FileNotFoundException("File does not exist: " + 
> iip.getPath());
>   }
>   else {
> // Make it relinquish locks everytime contentCountLimit entries are
> // processed. 0 means disabled. I.e. blocking for the entire duration.
> ContentSummaryComputationContext cscc =
> new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(),
> fsd.getContentCountLimit(), fsd.getContentSleepMicroSec());
> ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc);
> fsd.addYieldCount(cscc.getYieldCount());
> return cs;
>   }
> } finally {
>   fsd.readUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9054) Race condition on yieldCount in FSDirectory.java

2015-09-13 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742821#comment-14742821
 ] 

zhaoyunjiong commented on HDFS-9054:


Agree, thanks for your time.

> Race condition on yieldCount in FSDirectory.java
> 
>
> Key: HDFS-9054
> URL: https://issues.apache.org/jira/browse/HDFS-9054
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-9054.patch
>
>
> getContentSummaryInt only held read lock, and it called fsd.addYieldCount 
> which may cause race condition:
> {code}
>   private static ContentSummary getContentSummaryInt(FSDirectory fsd,
>   INodesInPath iip) throws IOException {
> fsd.readLock();
> try {
>   INode targetNode = iip.getLastINode();
>   if (targetNode == null) {
> throw new FileNotFoundException("File does not exist: " + 
> iip.getPath());
>   }
>   else {
> // Make it relinquish locks everytime contentCountLimit entries are
> // processed. 0 means disabled. I.e. blocking for the entire duration.
> ContentSummaryComputationContext cscc =
> new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(),
> fsd.getContentCountLimit(), fsd.getContentSleepMicroSec());
> ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc);
> fsd.addYieldCount(cscc.getYieldCount());
> return cs;
>   }
> } finally {
>   fsd.readUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9054) Race condition on yieldCount in FSDirectory.java

2015-09-13 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-9054:
---
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

> Race condition on yieldCount in FSDirectory.java
> 
>
> Key: HDFS-9054
> URL: https://issues.apache.org/jira/browse/HDFS-9054
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-9054.patch
>
>
> getContentSummaryInt only held read lock, and it called fsd.addYieldCount 
> which may cause race condition:
> {code}
>   private static ContentSummary getContentSummaryInt(FSDirectory fsd,
>   INodesInPath iip) throws IOException {
> fsd.readLock();
> try {
>   INode targetNode = iip.getLastINode();
>   if (targetNode == null) {
> throw new FileNotFoundException("File does not exist: " + 
> iip.getPath());
>   }
>   else {
> // Make it relinquish locks everytime contentCountLimit entries are
> // processed. 0 means disabled. I.e. blocking for the entire duration.
> ContentSummaryComputationContext cscc =
> new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(),
> fsd.getContentCountLimit(), fsd.getContentSleepMicroSec());
> ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc);
> fsd.addYieldCount(cscc.getYieldCount());
> return cs;
>   }
> } finally {
>   fsd.readUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-5367:
--

 Summary: Restore fsimage locked NameNode too long when the size of 
fsimage are big
 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


Our cluster have 40G fsimage, we write one copy of edit log to NFS.
After NFS temporary failed, when doing checkpoint, NameNode try to recover it, 
and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = 320 
seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: (was: HDFS-5367)

> Restore fsimage locked NameNode too long when the size of fsimage are big
> -
>
> Key: HDFS-5367
> URL: https://issues.apache.org/jira/browse/HDFS-5367
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>
> Our cluster have 40G fsimage, we write one copy of edit log to NFS.
> After NFS temporary failed, when doing checkpoint, NameNode try to recover 
> it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = 
> 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: HDFS-5367

The fsimage restored when SecondaryNameNode call rollEditLog will be replaced 
soon when SecondaryNameNode call rollFsImage.
So I think restore fsimage is not necessary.

> Restore fsimage locked NameNode too long when the size of fsimage are big
> -
>
> Key: HDFS-5367
> URL: https://issues.apache.org/jira/browse/HDFS-5367
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>
> Our cluster have 40G fsimage, we write one copy of edit log to NFS.
> After NFS temporary failed, when doing checkpoint, NameNode try to recover 
> it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = 
> 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: HDFS-5367-branch-1.2.patch

This patch avoid restore fsimage to make rollEditLog finished as soon as 
possible.

> Restore fsimage locked NameNode too long when the size of fsimage are big
> -
>
> Key: HDFS-5367
> URL: https://issues.apache.org/jira/browse/HDFS-5367
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5367-branch-1.2.patch
>
>
> Our cluster have 40G fsimage, we write one copy of edit log to NFS.
> After NFS temporary failed, when doing checkpoint, NameNode try to recover 
> it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = 
> 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5367) Restoring namenode storage locks namenode due to unnecessary fsimage write

2013-10-17 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798610#comment-13798610
 ] 

zhaoyunjiong commented on HDFS-5367:


Thank you for your review.

> Restoring namenode storage locks namenode due to unnecessary fsimage write
> --
>
> Key: HDFS-5367
> URL: https://issues.apache.org/jira/browse/HDFS-5367
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 1.2.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 1.3.0
>
> Attachments: HDFS-5367-branch-1.2.patch
>
>
> Our cluster have 40G fsimage, we write one copy of edit log to NFS.
> After NFS temporary failed, when doing checkpoint, NameNode try to recover 
> it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = 
> 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-21 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-5396:
--

 Summary: FSImage.getFsImageName should check whether fsimage exists
 Key: HDFS-5396
 URL: https://issues.apache.org/jira/browse/HDFS-5396
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0


In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
all IMAGE dir, so we need to check whether fsimage exists before 
FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-21 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5396:
---

Attachment: HDFS-5396-branch-1.2.patch

Check whether fsimage exists before return.

> FSImage.getFsImageName should check whether fsimage exists
> --
>
> Key: HDFS-5396
> URL: https://issues.apache.org/jira/browse/HDFS-5396
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 1.3.0
>
> Attachments: HDFS-5396-branch-1.2.patch
>
>
> In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
> all IMAGE dir, so we need to check whether fsimage exists before 
> FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-22 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong resolved HDFS-5396.


Resolution: Not A Problem

> FSImage.getFsImageName should check whether fsimage exists
> --
>
> Key: HDFS-5396
> URL: https://issues.apache.org/jira/browse/HDFS-5396
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 1.3.0
>
> Attachments: HDFS-5396-branch-1.2.patch
>
>
> In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
> all IMAGE dir, so we need to check whether fsimage exists before 
> FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-22 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802545#comment-13802545
 ] 

zhaoyunjiong commented on HDFS-5396:


The first image storage dir always have fsimage file in it.
Restored image storage always append to the end.
So the first one must have fsimage in it.

> FSImage.getFsImageName should check whether fsimage exists
> --
>
> Key: HDFS-5396
> URL: https://issues.apache.org/jira/browse/HDFS-5396
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 1.3.0
>
> Attachments: HDFS-5396-branch-1.2.patch
>
>
> In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
> all IMAGE dir, so we need to check whether fsimage exists before 
> FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-27 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-5579:
--

 Summary: Under construction files make DataNode decommission take 
very long hours
 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.2.0, 1.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


We noticed that some times decommission DataNodes takes very long time, even 
exceeds 100 hours.
After check the code, I found that in 
BlockManager:computeReplicationWorkForBlocks(List> 
blocksToReplicate) it won't replicate blocks which belongs to under 
construction files, however in 
BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  is 
block need replicate no matter whether it belongs to under construction or not, 
the decommission progress will continue running.
That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-27 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579.patch
HDFS-5579-branch-1.2.patch

This patch let NameNode can replicate blocks belongs to under construction 
files but not the last block.
And if the decommissioning DataNodes only have some blocks which are the last 
blocks of under construction files and have more than 1 live replicates left 
behind, then NameNode could set it to DECOMMISSIONED.

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-28 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579-branch-1.2.patch
HDFS-5579.patch

Thanks Vinay.
Update patch as your comments.
Except: getLastBlock  do throws IOException, I deleted it in this patch.

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, 
> HDFS-5579.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-28 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579-branch-1.2.patch)

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-28 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-12-05 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579.patch
HDFS-5579-branch-1.2.patch

Update patch, added test case for trunk.

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, 
> HDFS-5579.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-12-05 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-12-05 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579-branch-1.2.patch)

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-08 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865202#comment-13865202
 ] 

zhaoyunjiong commented on HDFS-5579:


It's already in the patch.
+if (bc.isUnderConstruction()) {
+  if (block.equals(bc.getLastBlock()) && curReplicas > minReplication) 
{
+continue;
+  }
+  underReplicatedInOpenFiles++;
+}

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-08 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579-branch-1.2.patch
HDFS-5579.patch

Good point. Thanks Jing.
Update patches to fix this problem.

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, 
> HDFS-5579.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-08 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579-branch-1.2.patch)

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-08 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-08 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579.patch

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-08 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-13 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870385#comment-13870385
 ] 

zhaoyunjiong commented on HDFS-5579:


Thanks for your time to review the patch, Jing.

> Under construction files make DataNode decommission take very long hours
> 
>
> Key: HDFS-5579
> URL: https://issues.apache.org/jira/browse/HDFS-5579
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 2.4.0
>
> Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch
>
>
> We noticed that some times decommission DataNodes takes very long time, even 
> exceeds 100 hours.
> After check the code, I found that in 
> BlockManager:computeReplicationWorkForBlocks(List> 
> blocksToReplicate) it won't replicate blocks which belongs to under 
> construction files, however in 
> BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
> is block need replicate no matter whether it belongs to under construction or 
> not, the decommission progress will continue running.
> That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-2139) Fast copy for HDFS.

2014-06-12 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-2139:
---

Attachment: HDFS-2139.patch

Thanks Guo Ruijing  & Daryn Sharp for your time.
Update patch according to the comments:
1. add clone in DistributedFileSystem
2. add check block tokens
3. support clone part of the file, the last block still use hardlink, then use 
truncateBlock to adjust block size and meta file.

Yes, DN enforce no linking of UC blocks.


> Fast copy for HDFS.
> ---
>
> Key: HDFS-2139
> URL: https://issues.apache.org/jira/browse/HDFS-2139
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pritam Damania
> Attachments: HDFS-2139.patch, HDFS-2139.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-01 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-6616:
--

 Summary: bestNode shouldn't always return the first DataNode
 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor


When we are doing distcp between clusters, job failed:
014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
part-r-00101.avro : java.net.NoRouteToHostException: No route to host
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
at java.security.AccessController.doPrivileged(Native Method)
at 
sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
at 
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

The root reason is one of the DataNode can't access from outside, but inside 
cluster, it's health.
In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-01 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.patch

One possible solution is choose DataNode randomly with the cost of ignore the 
network distance.

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-01 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.patch

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-01 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: (was: HDFS-6616.patch)

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-01 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049669#comment-14049669
 ] 

zhaoyunjiong commented on HDFS-6616:


What happened on our cluster is very rare case.
Server use HDP2.1 and client use HDP1.3, so I come up this patch.

Correct me if I'm wrong: when using WebHDFS, I think it will be very rare that 
both client and the data will be in the same host.
But I agree with you support exclude nodes in WebHDFS is a better idea.

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-07-02 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133-2.patch

Thanks Daryn Sharp for your time.
Update patch, use boolean instead of Boolean.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-02 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051076#comment-14051076
 ] 

zhaoyunjiong commented on HDFS-6616:


Yes. You are right. 
I never thought user may use WebHDFS as source and target filesystem, and 
running distcp job on source cluster.
For our use case, we always run jobs on target cluster and use WebHDFS as 
source filesystem.

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-16 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.1.patch

Update patch to support exclude nodes in WebHDFS.

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.1.patch, HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-17 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.2.patch

Thanks Tsz Wo Nicholas Sze & Jing Zhao.

Update patch according to comments: change ExcludeDatanodesParam.NAME to 
"excludedatanodes" and change WebHdfsFileSystem to use the exclude datanode 
feature.

The test failures is  not related.

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-18 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.3.patch

Update patch according to comments and fix test failures.

> bestNode shouldn't always return the first DataNode
> ---
>
> Key: HDFS-6616
> URL: https://issues.apache.org/jira/browse/HDFS-6616
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.3.patch, 
> HDFS-6616.patch
>
>
> When we are doing distcp between clusters, job failed:
> 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
> part-r-00101.avro : java.net.NoRouteToHostException: No route to host
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
>   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
>   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> The root reason is one of the DataNode can't access from outside, but inside 
> cluster, it's health.
> In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
> so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster

2014-08-06 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-6829:
--

 Summary: DFSAdmin refreshSuperUserGroupsConfiguration failed in 
security cluster
 Key: HDFS-6829
 URL: https://issues.apache.org/jira/browse/HDFS-6829
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 2.4.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor


When we run command "hadoop dfsadmin -refreshSuperUserGroupsConfiguration", it 
failed and report below message:
14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The 
serverPrincipal = doesn't confirm to the standards
refreshSuperUserGroupsConfiguration: null

After check the code, I found the bug was triggered by below reasons:
1. We didn't set CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, 
which needed by RefreshUserMappingsProtocol. And in DFSAdmin, if no 
CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will try 
to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: 
conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY,   
conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, ""));
2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in hdfs-site.xml
3. DFSAdmin didn't load hdfs-site.xml





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster

2014-08-06 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6829:
---

Attachment: HDFS-6829.patch

This patch is very simple, use HdfsConfiguration to load hdfs-site.xml when 
construct DFSAdmin.

> DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster
> ---
>
> Key: HDFS-6829
> URL: https://issues.apache.org/jira/browse/HDFS-6829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 2.4.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>Priority: Minor
> Attachments: HDFS-6829.patch
>
>
> When we run command "hadoop dfsadmin -refreshSuperUserGroupsConfiguration", 
> it failed and report below message:
> 14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The 
> serverPrincipal = doesn't confirm to the standards
> refreshSuperUserGroupsConfiguration: null
> After check the code, I found the bug was triggered by below reasons:
> 1. We didn't set 
> CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, which needed 
> by RefreshUserMappingsProtocol. And in DFSAdmin, if no 
> CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will 
> try to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: 
> conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY,   
> conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, ""));
> 2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in 
> hdfs-site.xml
> 3. DFSAdmin didn't load hdfs-site.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy

2014-09-11 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-7044:
--

 Summary: Support retention policy based on access time and modify 
time, use XAttr to store policy
 Key: HDFS-7044
 URL: https://issues.apache.org/jira/browse/HDFS-7044
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


The basic idea is set retention policy on directory based on access time and 
modify time and use XAttr to store policy.
Files under directory which have retention policy will be delete if meet the 
retention rule.
There are three rule:
# access time 
#* If (accessTime + retentionTimeForAccess < now), the file will be delete
# modify time
#* If (modifyTime + retentionTimeForModify < now), the file will be delete
# access time and modify time
#* If (accessTime + retentionTimeForAccess < now && modifyTime + 
retentionTimeForModify < now ), the file will be delete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy

2014-09-11 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7044:
---
Attachment: Retention policy design.pdf

Attach a simple design document.
The major difference between HDFS-7044 and HDFS-6382 are(Please correct me if 
I'm wrong, I just knew HDFS-6382 was trying to solve same problem):
# HDFS-6382 is standalone daemon outside NameNode, HDFS-7044 will be inside 
NameNode, I believe HDFS-7044 will be more simple and efficient.
# HDFS-7044 allows user set policy based on access time or modify time, 
HDFS-6382 only support one ttl.

> Support retention policy based on access time and modify time, use XAttr to 
> store policy
> 
>
> Key: HDFS-7044
> URL: https://issues.apache.org/jira/browse/HDFS-7044
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: Retention policy design.pdf
>
>
> The basic idea is set retention policy on directory based on access time and 
> modify time and use XAttr to store policy.
> Files under directory which have retention policy will be delete if meet the 
> retention rule.
> There are three rule:
> # access time 
> #* If (accessTime + retentionTimeForAccess < now), the file will be delete
> # modify time
> #* If (modifyTime + retentionTimeForModify < now), the file will be delete
> # access time and modify time
> #* If (accessTime + retentionTimeForAccess < now && modifyTime + 
> retentionTimeForModify < now ), the file will be delete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy

2014-09-17 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong resolved HDFS-7044.

Resolution: Duplicate

Thanks  Allen Wittenauer and Zesheng Wu.
After I read the comments in HDFS-6382, now I understand the concerns.

> Support retention policy based on access time and modify time, use XAttr to 
> store policy
> 
>
> Key: HDFS-7044
> URL: https://issues.apache.org/jira/browse/HDFS-7044
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: Retention policy design.pdf
>
>
> The basic idea is set retention policy on directory based on access time and 
> modify time and use XAttr to store policy.
> Files under directory which have retention policy will be delete if meet the 
> retention rule.
> There are three rule:
> # access time 
> #* If (accessTime + retentionTimeForAccess < now), the file will be delete
> # modify time
> #* If (modifyTime + retentionTimeForModify < now), the file will be delete
> # access time and modify time
> #* If (accessTime + retentionTimeForAccess < now && modifyTime + 
> retentionTimeForModify < now ), the file will be delete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-10-09 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-3.patch

Update patch for merge the trunk.
{quote}
Why we always pass false in below?
1653new Sender(out).writeBlock(b, accessToken, clientname, targets,
1654srcNode, stage, 0, 0, 0, 0, blockSender.getChecksum(),
1655cachingStrategy, false);
{quote}
This code path happens when NameNode ask DataNode send block to other 
DataNode(DatanodeProtocol.DNA_TRANSFER), it's not trigged by client, so there 
is no need pinning the block in this case.

{quote}
We will never copy a block?
925 if (datanode.data.getPinning(block))
926 String msg = "Not able to copy block " + block.getBlockId() + " " + 927 
"to " + peer.getRemoteAddressString() + " because it's pinned "; 928
LOG.info(msg); 929  sendResponse(ERROR, msg); 
Any thing to help ensure replica count does not rot when this pinning is 
enabled?
{quote}
When the block is under replicate, NameNode will send 
DatanodeProtocol.DNA_TRANSFER command to DataNode and it handled by 
DataTransfer, pinning won't affect that.



> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-10-10 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: (was: HDFS-6133-3.patch)

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-10-10 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-3.patch

Update patch, merge with trunk.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-11-18 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-4.patch

Thanks Yongjun Zhang.
Update patch according to comments.

{quote}
The concept of favoredNodes pre-existed before your patch, now your patch 
defines that as long as favoredNodes is passed, then pinning will be done. So 
we are changing the prior definition of how favoredNodes are used. Why not add 
some additional interface to tell that pinning will happen so we have the 
option not to pin even if favoredNodes is passed? Not necessarily you need to 
do what I suggested here, but I'd like to understand your thoughts here.
{quote}
I think most of time if you use favoredNodes, you'd like to keep the block on 
that machine, so to keep things simple, I didn't add new interface.
{quote}
Do we ever need interface to do unpinning?
{quote}
We can add unpinning in another issue if there are user case need that.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck

2014-11-23 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-7429:
--

 Summary: DomainSocketWatcher.doPoll0 stuck
 Key: HDFS-7429
 URL: https://issues.apache.org/jira/browse/HDFS-7429
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong


I found some of our DataNodes will run "exceeds the limit of concurrent 
xciever", the limit is 4K.

After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck:
{quote}
"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
[0x7f558d5d4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000740df9c90> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
--
"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
[0x7f558d3d2000]
   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)

"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
[0x7f558d7d6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000740df9cb0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
 

"Thread-163852" daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable 
[0x7f55aef6e000]
   java.lang.Thread.State: RUNNABLE 
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
at java.lang.Thre

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck

2014-11-23 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Attachment: 11241025
11241023
11241021

Upload more stack trace files.

> DomainSocketWatcher.doPoll0 stuck
> -
>
> Key: HDFS-7429
> URL: https://issues.apache.org/jira/browse/HDFS-7429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
> Attachments: 11241021, 11241023, 11241025
>
>
> I found some of our DataNodes will run "exceeds the limit of concurrent 
> xciever", the limit is 4K.
> After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck:
> {quote}
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
> [0x7f558d5d4000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9c90> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
> at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> --
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
> [0x7f558d3d2000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
> at 
> org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
> at 
> org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
> [0x7f558d7d6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9cb0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck

2014-11-25 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Summary: DomainSocketWatcher.kick stuck  (was: DomainSocketWatcher.doPoll0 
stuck)

> DomainSocketWatcher.kick stuck
> --
>
> Key: HDFS-7429
> URL: https://issues.apache.org/jira/browse/HDFS-7429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
> Attachments: 11241021, 11241023, 11241025
>
>
> I found some of our DataNodes will run "exceeds the limit of concurrent 
> xciever", the limit is 4K.
> After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck:
> {quote}
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
> [0x7f558d5d4000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9c90> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
> at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> --
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
> [0x7f558d3d2000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
> at 
> org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
> at 
> org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
> [0x7f558d7d6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9cb0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:745)
>

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck

2014-11-25 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Description: 
I found some of our DataNodes will run "exceeds the limit of concurrent 
xciever", the limit is 4K.

After check the stack, I suspect that 
org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
DomainSocketWatcher.kick stuck:
{quote}
"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
[0x7f558d5d4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000740df9c90> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
--
"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
[0x7f558d3d2000]
   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)

"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
[0x7f558d7d6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000740df9cb0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
 

"Thread-163852" daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable 
[0x7f55aef6e000]
   java.lang.Thread.State: RUNNABLE 
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
at java.lang.Thread.run(Thread.java:745)
{quote}

  was:
I found

[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck

2014-11-25 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224249#comment-14224249
 ] 

zhaoyunjiong commented on HDFS-7429:


The previous description is not right. 
The stuck thread happened at 
org.apache.hadoop.net.unix.DomainSocket.writeArray0 as below shows.
{quote}
$ grep -B2 -A10 DomainSocket.writeArray 1124102*
11241021-"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for 
operation #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
11241021-   java.lang.Thread.State: RUNNABLE
11241021:   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native 
Method)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
11241021-   at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
11241021-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
11241021-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
11241021-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
11241021-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
11241021-   at java.lang.Thread.run(Thread.java:745)
--
--
11241023-"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for 
operation #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
11241023-   java.lang.Thread.State: RUNNABLE
11241023:   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native 
Method)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
11241023-   at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
11241023-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
11241023-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
11241023-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
11241023-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
11241023-   at java.lang.Thread.run(Thread.java:745)
--
--
11241025-"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for 
operation #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
11241025-   java.lang.Thread.State: RUNNABLE
11241025:   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native 
Method)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
11241025-   at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
11241025-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
11241025-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
11241025-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
11241025-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
11241025-   at java.lang.Thread.run(Thread.java:745)
{quote}


> DomainSocketWatcher.kick stuck
> --
>
> Key: HDFS-7429
> URL: https://issues.apache.org/jira/browse/HDFS-7429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
> Attachments: 11241021, 11241023, 11241025
>
>
> I found some of our DataNodes will run "exceeds the limit of concurrent 
> xciever", the limit is 4K.
> After check the stack, I suspect that 
> org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
>

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck

2014-11-25 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Description: 
I found some of our DataNodes will run "exceeds the limit of concurrent 
xciever", the limit is 4K.

After check the stack, I suspect that 
org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
DomainSocketWatcher.kick stuck:
{quote}
"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
[0x7f558d5d4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000740df9c90> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
--
"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)

"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
[0x7f558d7d6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000740df9cb0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
 

"Thread-163852" daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable 
[0x7f55aef6e000]
   java.lang.Thread.State: RUNNABLE 
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
at java.lang.Thr

[jira] [Assigned] (HDFS-7429) DomainSocketWatcher.kick stuck

2014-11-25 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong reassigned HDFS-7429:
--

Assignee: zhaoyunjiong

> DomainSocketWatcher.kick stuck
> --
>
> Key: HDFS-7429
> URL: https://issues.apache.org/jira/browse/HDFS-7429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: 11241021, 11241023, 11241025
>
>
> I found some of our DataNodes will run "exceeds the limit of concurrent 
> xciever", the limit is 4K.
> After check the stack, I suspect that 
> org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
> DomainSocketWatcher.kick stuck:
> {quote}
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
> [0x7f558d5d4000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9c90> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
> at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> --
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
> [0x7f7db06c5000]
>java.lang.Thread.State: RUNNABLE
>   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
>   at 
> org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
>   at 
> org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
>   at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
>   at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
>   at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
>   at java.lang.Thread.run(Thread.java:745)
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
> [0x7f558d7d6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9cb0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.

[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck

2014-11-25 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224325#comment-14224325
 ] 

zhaoyunjiong commented on HDFS-7429:


The problem here is in our machine we can only send 299 bytes to domain socket.
When it try to send the 300 byte, it will block, and the 
DomainSocketWatcher.add(DomainSocket sock, Handler handler)  have the lock, so 
watcherThread.run can't get the lock and clear the buffer, it's a live lock.

I'm not sure which configuration controls the bufferSize of 299 for now.
Now I suspect net.core.netdev_budget, which is 300 at our machines.
I'll upload a patch to control the send bytes to prevent live lock later.

By the way, should I move this to HADOOP COMMON project?

> DomainSocketWatcher.kick stuck
> --
>
> Key: HDFS-7429
> URL: https://issues.apache.org/jira/browse/HDFS-7429
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
> Attachments: 11241021, 11241023, 11241025
>
>
> I found some of our DataNodes will run "exceeds the limit of concurrent 
> xciever", the limit is 4K.
> After check the stack, I suspect that 
> org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
> DomainSocketWatcher.kick stuck:
> {quote}
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
> [0x7f558d5d4000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9c90> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
> at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
> at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> --
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
> [0x7f7db06c5000]
>java.lang.Thread.State: RUNNABLE
>   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
>   at 
> org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
>   at 
> org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
>   at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
>   at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
>   at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
>   at java.lang.Thread.run(Thread.java:745)
> "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
> #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
> [0x7f558d7d6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000740df9cb0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at 
> org.apache.hadoop.net.unix.DomainSocketWatche

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-11-25 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-5.patch

Thanks Yongjun Zhang, update patch to fix the format.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile

2014-12-02 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-7470:
--

 Summary: SecondaryNameNode need twice memory when calling 
reloadFromImageFile
 Key: HDFS-7470
 URL: https://issues.apache.org/jira/browse/HDFS-7470
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


histo information at 2014-12-02 01:19
{quote}
 num #instances #bytes  class name
--
   1: 18644963019326123016  [Ljava.lang.Object;
   2: 15736664915107198304  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   3: 18340903011738177920  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   4: 157358401 5244264024  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   5: 3 3489661000  
[Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
   6:  29253275 1872719664  [B
   7:   3230821  284312248  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   8:   2756284  110251360  java.util.ArrayList
   9:469158   22519584  org.apache.hadoop.fs.permission.AclEntry
  10:   847   17133032  [Ljava.util.HashMap$Entry;
  11:188471   17059632  [C
  12:314614   10067656  
[Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
  13:2345799383160  
com.google.common.collect.RegularImmutableList
  14: 495846850280  
  15: 495846356704  
  16:1872705992640  java.lang.String
  17:2345795629896  
org.apache.hadoop.hdfs.server.namenode.AclFeature
{quote}
histo information at 2014-12-02 01:32
{quote}
 num #instances #bytes  class name
--
   1: 35583805135566651032  [Ljava.lang.Object;
   2: 30227275829018184768  
org.apache.hadoop.hdfs.server.namenode.INodeFile
   3: 35250072322560046272  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
   4: 30226451010075087952  
[Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
   5: 177120233 9374983920  [B
   6: 3 3489661000  
[Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
   7:   6191688  544868544  
org.apache.hadoop.hdfs.server.namenode.INodeDirectory
   8:   2799256  111970240  java.util.ArrayList
   9:890728   42754944  org.apache.hadoop.fs.permission.AclEntry
  10:330986   29974408  [C
  11:596871   19099880  
[Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
  12:445364   17814560  
com.google.common.collect.RegularImmutableList
  13:   844   17132816  [Ljava.util.HashMap$Entry;
  14:445364   10688736  
org.apache.hadoop.hdfs.server.namenode.AclFeature
  15:329789   10553248  java.lang.String
  16: 917418807136  
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction
  17: 495846850280  
{quote}

And the stack trace shows it was doing reloadFromImageFile:
{quote}
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1630)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413)
at 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:350)
at java.lang.Thread.run(Thread.java:745)
{quote}

So before doing reloadFromImageFile, I thi

[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile

2014-12-02 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7470:
---
Attachment: HDFS-7470.patch

This patch re-init namesystem before doing reloadFromImageFile.

> SecondaryNameNode need twice memory when calling reloadFromImageFile
> 
>
> Key: HDFS-7470
> URL: https://issues.apache.org/jira/browse/HDFS-7470
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7470.patch
>
>
> histo information at 2014-12-02 01:19
> {quote}
>  num #instances #bytes  class name
> --
>1: 18644963019326123016  [Ljava.lang.Object;
>2: 15736664915107198304  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 18340903011738177920  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 157358401 5244264024  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>6:  29253275 1872719664  [B
>7:   3230821  284312248  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2756284  110251360  java.util.ArrayList
>9:469158   22519584  org.apache.hadoop.fs.permission.AclEntry
>   10:   847   17133032  [Ljava.util.HashMap$Entry;
>   11:188471   17059632  [C
>   12:314614   10067656  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   13:2345799383160  
> com.google.common.collect.RegularImmutableList
>   14: 495846850280  
>   15: 495846356704  
>   16:1872705992640  java.lang.String
>   17:2345795629896  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
> {quote}
> histo information at 2014-12-02 01:32
> {quote}
>  num #instances #bytes  class name
> --
>1: 35583805135566651032  [Ljava.lang.Object;
>2: 30227275829018184768  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 35250072322560046272  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 30226451010075087952  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 177120233 9374983920  [B
>6: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>7:   6191688  544868544  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2799256  111970240  java.util.ArrayList
>9:890728   42754944  org.apache.hadoop.fs.permission.AclEntry
>   10:330986   29974408  [C
>   11:596871   19099880  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   12:445364   17814560  
> com.google.common.collect.RegularImmutableList
>   13:   844   17132816  [Ljava.util.HashMap$Entry;
>   14:445364   10688736  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
>   15:329789   10553248  java.lang.String
>   16: 917418807136  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction
>   17: 495846850280  
> {quote}
> And the stack trace shows it was doing reloadFromImageFile:
> {quote}
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354)
>   at java.security.AccessController.doPrivileged(Na

[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile

2014-12-02 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7470:
---
Status: Patch Available  (was: Open)

> SecondaryNameNode need twice memory when calling reloadFromImageFile
> 
>
> Key: HDFS-7470
> URL: https://issues.apache.org/jira/browse/HDFS-7470
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7470.patch
>
>
> histo information at 2014-12-02 01:19
> {quote}
>  num #instances #bytes  class name
> --
>1: 18644963019326123016  [Ljava.lang.Object;
>2: 15736664915107198304  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 18340903011738177920  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 157358401 5244264024  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>6:  29253275 1872719664  [B
>7:   3230821  284312248  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2756284  110251360  java.util.ArrayList
>9:469158   22519584  org.apache.hadoop.fs.permission.AclEntry
>   10:   847   17133032  [Ljava.util.HashMap$Entry;
>   11:188471   17059632  [C
>   12:314614   10067656  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   13:2345799383160  
> com.google.common.collect.RegularImmutableList
>   14: 495846850280  
>   15: 495846356704  
>   16:1872705992640  java.lang.String
>   17:2345795629896  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
> {quote}
> histo information at 2014-12-02 01:32
> {quote}
>  num #instances #bytes  class name
> --
>1: 35583805135566651032  [Ljava.lang.Object;
>2: 30227275829018184768  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 35250072322560046272  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 30226451010075087952  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 177120233 9374983920  [B
>6: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>7:   6191688  544868544  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2799256  111970240  java.util.ArrayList
>9:890728   42754944  org.apache.hadoop.fs.permission.AclEntry
>   10:330986   29974408  [C
>   11:596871   19099880  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   12:445364   17814560  
> com.google.common.collect.RegularImmutableList
>   13:   844   17132816  [Ljava.util.HashMap$Entry;
>   14:445364   10688736  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
>   15:329789   10553248  java.lang.String
>   16: 917418807136  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction
>   17: 495846850280  
> {quote}
> And the stack trace shows it was doing reloadFromImageFile:
> {quote}
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs

[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile

2014-12-04 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7470:
---
Attachment: HDFS-7470.1.patch

Update patch to fix test failure.

> SecondaryNameNode need twice memory when calling reloadFromImageFile
> 
>
> Key: HDFS-7470
> URL: https://issues.apache.org/jira/browse/HDFS-7470
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7470.1.patch, HDFS-7470.patch
>
>
> histo information at 2014-12-02 01:19
> {quote}
>  num #instances #bytes  class name
> --
>1: 18644963019326123016  [Ljava.lang.Object;
>2: 15736664915107198304  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 18340903011738177920  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 157358401 5244264024  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>6:  29253275 1872719664  [B
>7:   3230821  284312248  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2756284  110251360  java.util.ArrayList
>9:469158   22519584  org.apache.hadoop.fs.permission.AclEntry
>   10:   847   17133032  [Ljava.util.HashMap$Entry;
>   11:188471   17059632  [C
>   12:314614   10067656  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   13:2345799383160  
> com.google.common.collect.RegularImmutableList
>   14: 495846850280  
>   15: 495846356704  
>   16:1872705992640  java.lang.String
>   17:2345795629896  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
> {quote}
> histo information at 2014-12-02 01:32
> {quote}
>  num #instances #bytes  class name
> --
>1: 35583805135566651032  [Ljava.lang.Object;
>2: 30227275829018184768  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 35250072322560046272  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 30226451010075087952  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 177120233 9374983920  [B
>6: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>7:   6191688  544868544  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2799256  111970240  java.util.ArrayList
>9:890728   42754944  org.apache.hadoop.fs.permission.AclEntry
>   10:330986   29974408  [C
>   11:596871   19099880  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   12:445364   17814560  
> com.google.common.collect.RegularImmutableList
>   13:   844   17132816  [Ljava.util.HashMap$Entry;
>   14:445364   10688736  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
>   15:329789   10553248  java.lang.String
>   16: 917418807136  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction
>   17: 495846850280  
> {quote}
> And the stack trace shows it was doing reloadFromImageFile:
> {quote}
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354)
>   at java.security.AccessController.doPrivileged(Native Meth

[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile

2014-12-11 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7470:
---
Attachment: secondaryNameNode.jstack.txt

Thanks Chris Nauroth for your time.
Upload a stack trace file for SecondaryNameNode.

Correct me if I'm wrong, from stack trace, I think there won't have two threads 
hold FSNamesystem.writeLock.
And SecondaryNameNode didn't start service like BlockManager and CacheManager.
For the edit log, SecondaryNameNode won't open it for write.

I'll check again whether I missed some risk or try to find out a more safer 
solution later.

> SecondaryNameNode need twice memory when calling reloadFromImageFile
> 
>
> Key: HDFS-7470
> URL: https://issues.apache.org/jira/browse/HDFS-7470
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7470.1.patch, HDFS-7470.patch, 
> secondaryNameNode.jstack.txt
>
>
> histo information at 2014-12-02 01:19
> {quote}
>  num #instances #bytes  class name
> --
>1: 18644963019326123016  [Ljava.lang.Object;
>2: 15736664915107198304  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 18340903011738177920  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 157358401 5244264024  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>6:  29253275 1872719664  [B
>7:   3230821  284312248  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2756284  110251360  java.util.ArrayList
>9:469158   22519584  org.apache.hadoop.fs.permission.AclEntry
>   10:   847   17133032  [Ljava.util.HashMap$Entry;
>   11:188471   17059632  [C
>   12:314614   10067656  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   13:2345799383160  
> com.google.common.collect.RegularImmutableList
>   14: 495846850280  
>   15: 495846356704  
>   16:1872705992640  java.lang.String
>   17:2345795629896  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
> {quote}
> histo information at 2014-12-02 01:32
> {quote}
>  num #instances #bytes  class name
> --
>1: 35583805135566651032  [Ljava.lang.Object;
>2: 30227275829018184768  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 35250072322560046272  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 30226451010075087952  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 177120233 9374983920  [B
>6: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>7:   6191688  544868544  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2799256  111970240  java.util.ArrayList
>9:890728   42754944  org.apache.hadoop.fs.permission.AclEntry
>   10:330986   29974408  [C
>   11:596871   19099880  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   12:445364   17814560  
> com.google.common.collect.RegularImmutableList
>   13:   844   17132816  [Ljava.util.HashMap$Entry;
>   14:445364   10688736  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
>   15:329789   10553248  java.lang.String
>   16: 917418807136  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction
>   17: 495846850280  
> {quote}
> And the stack trace shows it was doing reloadFromImageFile:
> {quote}
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562)
>   at 
> org.apache.hadoop.hdfs.server.namenode.Seco

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-12-21 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-6.patch

Thanks Kai Zheng for point out the bug in TestBalancer. Fixed in new patch.
Also update the comment mentioned.

For mover, it use replaceBlock same as balancer, so mover will not able to move 
blocks as well.

For torage type/policy, Kihwal Lee already answered the question:
{quote}
We will make additional changes in separate jiras to make NN aware of favored 
nodes. I think the patch in this jira is fine as a stepping stone for the 
further work.
{quote}

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile

2015-01-13 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7470:
---
Attachment: HDFS-7470.2.patch

This patch will clear BlocksMap in FSNamesystem.clear().
I believe this should release the memory.

> SecondaryNameNode need twice memory when calling reloadFromImageFile
> 
>
> Key: HDFS-7470
> URL: https://issues.apache.org/jira/browse/HDFS-7470
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7470.1.patch, HDFS-7470.2.patch, HDFS-7470.patch, 
> secondaryNameNode.jstack.txt
>
>
> histo information at 2014-12-02 01:19
> {quote}
>  num #instances #bytes  class name
> --
>1: 18644963019326123016  [Ljava.lang.Object;
>2: 15736664915107198304  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 18340903011738177920  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 157358401 5244264024  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>6:  29253275 1872719664  [B
>7:   3230821  284312248  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2756284  110251360  java.util.ArrayList
>9:469158   22519584  org.apache.hadoop.fs.permission.AclEntry
>   10:   847   17133032  [Ljava.util.HashMap$Entry;
>   11:188471   17059632  [C
>   12:314614   10067656  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   13:2345799383160  
> com.google.common.collect.RegularImmutableList
>   14: 495846850280  
>   15: 495846356704  
>   16:1872705992640  java.lang.String
>   17:2345795629896  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
> {quote}
> histo information at 2014-12-02 01:32
> {quote}
>  num #instances #bytes  class name
> --
>1: 35583805135566651032  [Ljava.lang.Object;
>2: 30227275829018184768  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 35250072322560046272  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 30226451010075087952  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 177120233 9374983920  [B
>6: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>7:   6191688  544868544  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2799256  111970240  java.util.ArrayList
>9:890728   42754944  org.apache.hadoop.fs.permission.AclEntry
>   10:330986   29974408  [C
>   11:596871   19099880  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   12:445364   17814560  
> com.google.common.collect.RegularImmutableList
>   13:   844   17132816  [Ljava.util.HashMap$Entry;
>   14:445364   10688736  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
>   15:329789   10553248  java.lang.String
>   16: 917418807136  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction
>   17: 495846850280  
> {quote}
> And the stack trace shows it was doing reloadFromImageFile:
> {quote}
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388)
>   at 
> org.apache.hadoop.hdfs.server.namenode.S

[jira] [Commented] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile

2015-01-13 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276347#comment-14276347
 ] 

zhaoyunjiong commented on HDFS-7470:


Chris, thanks for your time.

> SecondaryNameNode need twice memory when calling reloadFromImageFile
> 
>
> Key: HDFS-7470
> URL: https://issues.apache.org/jira/browse/HDFS-7470
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 2.7.0
>
> Attachments: HDFS-7470.1.patch, HDFS-7470.2.patch, HDFS-7470.patch, 
> secondaryNameNode.jstack.txt
>
>
> histo information at 2014-12-02 01:19
> {quote}
>  num #instances #bytes  class name
> --
>1: 18644963019326123016  [Ljava.lang.Object;
>2: 15736664915107198304  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 18340903011738177920  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 157358401 5244264024  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>6:  29253275 1872719664  [B
>7:   3230821  284312248  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2756284  110251360  java.util.ArrayList
>9:469158   22519584  org.apache.hadoop.fs.permission.AclEntry
>   10:   847   17133032  [Ljava.util.HashMap$Entry;
>   11:188471   17059632  [C
>   12:314614   10067656  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   13:2345799383160  
> com.google.common.collect.RegularImmutableList
>   14: 495846850280  
>   15: 495846356704  
>   16:1872705992640  java.lang.String
>   17:2345795629896  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
> {quote}
> histo information at 2014-12-02 01:32
> {quote}
>  num #instances #bytes  class name
> --
>1: 35583805135566651032  [Ljava.lang.Object;
>2: 30227275829018184768  
> org.apache.hadoop.hdfs.server.namenode.INodeFile
>3: 35250072322560046272  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
>4: 30226451010075087952  
> [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
>5: 177120233 9374983920  [B
>6: 3 3489661000  
> [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement;
>7:   6191688  544868544  
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory
>8:   2799256  111970240  java.util.ArrayList
>9:890728   42754944  org.apache.hadoop.fs.permission.AclEntry
>   10:330986   29974408  [C
>   11:596871   19099880  
> [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature;
>   12:445364   17814560  
> com.google.common.collect.RegularImmutableList
>   13:   844   17132816  [Ljava.util.HashMap$Entry;
>   14:445364   10688736  
> org.apache.hadoop.hdfs.server.namenode.AclFeature
>   15:329789   10553248  java.lang.String
>   16: 917418807136  
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction
>   17: 495846850280  
> {quote}
> And the stack trace shows it was doing reloadFromImageFile:
> {quote}
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536)
>   at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388)
>   at 
> org.apache.hadoop.hdfs.s

[jira] [Created] (HDFS-7735) Optimize decommission Datanodes to reduce the impact on NameNode's performance

2015-02-04 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-7735:
--

 Summary: Optimize decommission Datanodes to reduce the impact on 
NameNode's performance
 Key: HDFS-7735
 URL: https://issues.apache.org/jira/browse/HDFS-7735
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


When decommission DataNodes, by default, DecommissionManager will check 
progress every 30 seconds, and it will hold writeLock of Namesystem. It 
significantly impact NameNode's performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7735) Optimize decommission Datanodes to reduce the impact on NameNode's performance

2015-02-04 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7735:
---
Status: Patch Available  (was: Open)

> Optimize decommission Datanodes to reduce the impact on NameNode's performance
> --
>
> Key: HDFS-7735
> URL: https://issues.apache.org/jira/browse/HDFS-7735
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7735.patch
>
>
> When decommission DataNodes, by default, DecommissionManager will check 
> progress every 30 seconds, and it will hold writeLock of Namesystem. It 
> significantly impact NameNode's performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7735) Optimize decommission Datanodes to reduce the impact on NameNode's performance

2015-02-04 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7735:
---
Attachment: HDFS-7735.patch

This patch try to reduce the impact in two ways:
# Adjust dfs.namenode.decommission.interval from 30 to 300 to reduce the check 
frequency 
# Remove blocks already have enough live replicas from decommissioning nodes to 
reduce check time

> Optimize decommission Datanodes to reduce the impact on NameNode's performance
> --
>
> Key: HDFS-7735
> URL: https://issues.apache.org/jira/browse/HDFS-7735
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7735.patch
>
>
> When decommission DataNodes, by default, DecommissionManager will check 
> progress every 30 seconds, and it will hold writeLock of Namesystem. It 
> significantly impact NameNode's performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2015-02-05 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-7.patch

Thanks Tsz Wo Nicholas Sze.
Update patch to add targetPinnings, only pin on the favorite datanodes.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2015-02-05 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: (was: HDFS-6133-7.patch)

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2015-02-05 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-7.patch

Thanks Nicholas.
Update patch to fix test failures.
By the way, I use "dev-support/test-patch.sh" to test my patch yesterday, and 
it didn't find out the test failure, seems my local test environments have some 
problems.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2015-02-06 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-8.patch

This version should pass the tests.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, 
> HDFS-6133-8.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2015-02-08 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-9.patch

Thanks Nicholas, Yongjun Zhang and Benoy Antony.
Update patch add a configuration to enable/disable pin feature. I use a 
configuration set on DFSClient, compared to set it on DataNode side, I think it 
is more flexible.
{quote}
PBHelper.convert(..) only adds one FALSE when targetPinnings == null. Should we 
add n FALSEs, where n = targetPinnings.length?
{quote}
One is enough, because the DataNode in Pipeline will add another one when send 
to next.
For how to implement this feature on Windows and whether use sticky bit or a 
second file, can we do it on another jira issue later?

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, 
> HDFS-6133-8.patch, HDFS-6133-9.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2015-02-09 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-10.patch

Thanks Nicholas.
Update patch to use dfs.datanode.block-pinning.enabled.

{quote}
> I guess using existence of a second file is easier to be compatible across 
> different platforms (though sticky bit has the advantage of storing the info 
> at the same file).
If using a different file, we need to delete it when deleting the block. How 
about using the execute bit?
{quote}
I'm not sure whether there are block files created with execute bit set. I 
don't think it is safe enough to use execute bit.

I prefer Yongjun Zhang's mixed mechanism:
{quote}
Another option is to use a mixed mechanism, say, if it's on linux, use sticky 
bit, otherwise, use existence of a second file. This method means lack of 
consistency between different platforms. Just wanted to throw a thought here.
{quote}
If you agree with mixed mechanism, I'll implement it before my vacation start 
on Feb 14th.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, 
> HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2015-02-09 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313405#comment-14313405
 ] 

zhaoyunjiong commented on HDFS-6133:


You mean choosing the mechanism don't rely on the os but rely on the mechanism 
only?

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, 
> HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode

2015-02-09 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-7759:
--

 Summary: Provide existence-of-a-second-file implementation for 
pinning blocks on Datanode
 Key: HDFS-7759
 URL: https://issues.apache.org/jira/browse/HDFS-7759
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


Provide existence-of-a-second-file implementation for pinning blocks on 
Datanode  and let admin choosing the mechanism(use sticky bit or 
existence-of-a-second-file) to pinning blocks on favored Datanode.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2015-02-09 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313432#comment-14313432
 ] 

zhaoyunjiong commented on HDFS-6133:


Sure.
Create HDFS-7759 to track.
Thanks for your time to review the patch.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, 
> HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2015-02-10 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313832#comment-14313832
 ] 

zhaoyunjiong commented on HDFS-6133:


Most of user will use 3 replicates, so NumFavoredNodes should be 3 for most 
case. I don't think this will cause big problem.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, 
> HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2015-02-10 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-11.patch

Update patch to solve merge conflict in 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, 
> HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, 
> HDFS-6133-9.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2015-02-11 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317700#comment-14317700
 ] 

zhaoyunjiong commented on HDFS-6133:


Sorry, which question? I clicked the URL, but don't show the question clearly.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, datanode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 2.7.0
>
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, 
> HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, 
> HDFS-6133-9.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2015-02-12 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319477#comment-14319477
 ] 

zhaoyunjiong commented on HDFS-6133:


{quote}
Is there use scenario that one want to specify larger number of favoredNodes?
{quote}
It might have.
Let's say if user set 1000 favoredNodes, and 10 replicates, 
DFSOutputStream.getPinnings will do 10,000 compare at worst case.
Seems not so bad.
Do you think we need optimize the code?

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, datanode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 2.7.0
>
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, 
> HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, 
> HDFS-6133-9.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode

2015-02-12 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7759:
---
Attachment: HDFS-7759.patch

I'll add more unit test and check the logic for upgrade and finalizeUpgrade two 
weeks later.

> Provide existence-of-a-second-file implementation for pinning blocks on 
> Datanode
> 
>
> Key: HDFS-7759
> URL: https://issues.apache.org/jira/browse/HDFS-7759
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7759.patch
>
>
> Provide existence-of-a-second-file implementation for pinning blocks on 
> Datanode  and let admin choosing the mechanism(use sticky bit or 
> existence-of-a-second-file) to pinning blocks on favored Datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2015-02-27 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339994#comment-14339994
 ] 

zhaoyunjiong commented on HDFS-6133:


At the first, my idea is very similar with HDFS-4420. The main difference is I 
change getBlocks to exclude blocks belongs to some path.
https://issues.apache.org/jira/browse/HDFS-6133?focusedCommentId=13943050&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13943050

After Daryn Sharp's comments, it changed to pin:   
https://issues.apache.org/jira/browse/HDFS-6133?focusedCommentId=13980504&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13980504


> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, datanode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 2.7.0
>
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, 
> HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, 
> HDFS-6133-9.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode

2015-02-27 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7759:
---
Status: Patch Available  (was: Open)

Just checked the upgrade logic, it will take care all the files start with 
"blk_".
Nicholas, do you have any comments for this patch?

> Provide existence-of-a-second-file implementation for pinning blocks on 
> Datanode
> 
>
> Key: HDFS-7759
> URL: https://issues.apache.org/jira/browse/HDFS-7759
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7759.patch
>
>
> Provide existence-of-a-second-file implementation for pinning blocks on 
> Datanode  and let admin choosing the mechanism(use sticky bit or 
> existence-of-a-second-file) to pinning blocks on favored Datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode

2015-03-01 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342638#comment-14342638
 ] 

zhaoyunjiong commented on HDFS-7759:


Tests failure are not related.

> Provide existence-of-a-second-file implementation for pinning blocks on 
> Datanode
> 
>
> Key: HDFS-7759
> URL: https://issues.apache.org/jira/browse/HDFS-7759
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-7759.patch
>
>
> Provide existence-of-a-second-file implementation for pinning blocks on 
> Datanode  and let admin choosing the mechanism(use sticky bit or 
> existence-of-a-second-file) to pinning blocks on favored Datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2014-02-10 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong reopened HDFS-5396:



I made a mistake when I resolved this as Not A Problem.
Because   
for (Iterator it = 
  dirIterator(NameNodeDirType.IMAGE); it.hasNext();)
sd = it.next(); 
will return last StorageDirectory of image, but due to HDFS-5367, it may not 
have fsimage in it.  

> FSImage.getFsImageName should check whether fsimage exists
> --
>
> Key: HDFS-5396
> URL: https://issues.apache.org/jira/browse/HDFS-5396
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 1.3.0
>
> Attachments: HDFS-5396-branch-1.2.patch
>
>
> In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
> all IMAGE dir, so we need to check whether fsimage exists before 
> FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-12 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-5944:
--

 Summary: LeaseManager:findLeaseWithPrefixPath didn't handle path 
like /a/b/ right cause SecondaryNameNode failed do checkpoint
 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.2.0, 1.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


In our cluster, we encountered error like this:
java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log 
but is not under construction.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)

What happened:
Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
And Client A continue refresh it's lease.
Client B deleted /XXX/20140206/04_30/
Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
Then secondaryNameNode try to do checkpoint and failed due to failed to delete 
lease hold by Client A when Client B deleted /XXX/20140206/04_30/.

The reason is this a bug in findLeaseWithPrefixPath:
 int srclen = prefix.length();
 if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
entries.put(entry.getKey(), entry.getValue());
  }
Here when prefix is /XXX/20140206/04_30/, and p is 
/XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-12 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Description: 
In our cluster, we encountered error like this:
java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log 
but is not under construction.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)

What happened:
Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
And Client A continue refresh it's lease.
Client B deleted /XXX/20140206/04_30/
Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
Then secondaryNameNode try to do checkpoint and failed due to failed to delete 
lease hold by Client A when Client B deleted /XXX/20140206/04_30/.

The reason is a bug in findLeaseWithPrefixPath:
 int srclen = prefix.length();
 if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
entries.put(entry.getKey(), entry.getValue());
  }
Here when prefix is /XXX/20140206/04_30/, and p is 
/XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
The fix is simple, I'll upload patch later.

  was:
In our cluster, we encountered error like this:
java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log 
but is not under construction.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)

What happened:
Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
And Client A continue refresh it's lease.
Client B deleted /XXX/20140206/04_30/
Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
Then secondaryNameNode try to do checkpoint and failed due to failed to delete 
lease hold by Client A when Client B deleted /XXX/20140206/04_30/.

The reason is this a bug in findLeaseWithPrefixPath:
 int srclen = prefix.length();
 if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
entries.put(entry.getKey(), entry.getValue());
  }
Here when prefix is /XXX/20140206/04_30/, and p is 
/XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
The fix is simple, I'll upload patch later.


> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA

[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-12 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: HDFS-5944.patch
HDFS-5944-branch-1.2.patch

This patch is very simple, if prefix ended with '/', just minus 1 from srclen, 
so p.charAt(srclen) could handle path correctly.

> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch
>
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-13 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901171#comment-13901171
 ] 

zhaoyunjiong commented on HDFS-5944:


Brandon, thanks for your time to review this patch.
I don't think the user use DFSClient directly.
Even use DistributedFileSystem, we still can send path ending with "/" by 
passing path like this "/a/b/../".
Because in getPathName, String result = makeAbsolute(file).toUri().getPath() 
will return "/a/".

About unit test, I'd be happy to add one. I have two questions need your help:
1. Is it enough for just writing a unit test for findLeaseWithPrefixPath?
2. In trunk, there is no TestLeaseManager.java, should I add one?

> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
> HDFS-5944.test.txt
>
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-17 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: HDFS-5944-branch-1.2.patch
HDFS-5944.patch

Update patches with unit test.

> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944-branch-1.2.patch, 
> HDFS-5944.patch, HDFS-5944.patch, HDFS-5944.test.txt
>
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-17 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: (was: HDFS-5944-branch-1.2.patch)

> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
> HDFS-5944.test.txt
>
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-17 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: (was: HDFS-5944.patch)

> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
> HDFS-5944.test.txt
>
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-19 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905361#comment-13905361
 ] 

zhaoyunjiong commented on HDFS-5944:


Multiple trailing "/" is impossible.

> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
> HDFS-5944.test.txt
>
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-19 Thread zhaoyunjiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906435#comment-13906435
 ] 

zhaoyunjiong commented on HDFS-5944:


Thank you Brandon and Benoy.

> LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
> cause SecondaryNameNode failed do checkpoint
> -
>
> Key: HDFS-5944
> URL: https://issues.apache.org/jira/browse/HDFS-5944
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.0, 2.2.0
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
> HDFS-5944.test.txt, HDFS-5944.trunk.patch
>
>
> In our cluster, we encountered error like this:
> java.io.IOException: saveLeases found path 
> /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
> What happened:
> Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
> And Client A continue refresh it's lease.
> Client B deleted /XXX/20140206/04_30/
> Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
> Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
> Then secondaryNameNode try to do checkpoint and failed due to failed to 
> delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
> The reason is a bug in findLeaseWithPrefixPath:
>  int srclen = prefix.length();
>  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
> entries.put(entry.getKey(), entry.getValue());
>   }
> Here when prefix is /XXX/20140206/04_30/, and p is 
> /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
> The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2014-02-20 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5396:
---

Attachment: HDFS-5396-branch-1.2.patch

Update patch.

> FSImage.getFsImageName should check whether fsimage exists
> --
>
> Key: HDFS-5396
> URL: https://issues.apache.org/jira/browse/HDFS-5396
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 1.3.0
>
> Attachments: HDFS-5396-branch-1.2.patch, HDFS-5396-branch-1.2.patch
>
>
> In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
> all IMAGE dir, so we need to check whether fsimage exists before 
> FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2014-03-18 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5396:
---

Status: Patch Available  (was: Reopened)

> FSImage.getFsImageName should check whether fsimage exists
> --
>
> Key: HDFS-5396
> URL: https://issues.apache.org/jira/browse/HDFS-5396
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.1
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 1.3.0
>
> Attachments: HDFS-5396-branch-1.2.patch, HDFS-5396-branch-1.2.patch
>
>
> In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
> all IMAGE dir, so we need to check whether fsimage exists before 
> FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase

2014-03-20 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-6133:
--

 Summary: Make Balancer support don't move blocks belongs to Hbase
 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


Currently, run Balancer will destroying Regionserver's data locality.
If getBlocks could exclude blocks belongs to files which have specific path 
prefix, like "/hbase", then we can run Balancer without destroying 
Regionserver's data locality.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase

2014-03-20 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133.patch

This patch make Balancer support don't move blocks belongs to Hbase

> Make Balancer support don't move blocks belongs to Hbase
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >