[jira] [Created] (HDFS-9054) Race condition on yieldCount in FSDirectory.java
zhaoyunjiong created HDFS-9054: -- Summary: Race condition on yieldCount in FSDirectory.java Key: HDFS-9054 URL: https://issues.apache.org/jira/browse/HDFS-9054 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor getContentSummaryInt only held read lock, and it called fsd.addYieldCount which may cause race condition: {code} private static ContentSummary getContentSummaryInt(FSDirectory fsd, INodesInPath iip) throws IOException { fsd.readLock(); try { INode targetNode = iip.getLastINode(); if (targetNode == null) { throw new FileNotFoundException("File does not exist: " + iip.getPath()); } else { // Make it relinquish locks everytime contentCountLimit entries are // processed. 0 means disabled. I.e. blocking for the entire duration. ContentSummaryComputationContext cscc = new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(), fsd.getContentCountLimit(), fsd.getContentSleepMicroSec()); ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc); fsd.addYieldCount(cscc.getYieldCount()); return cs; } } finally { fsd.readUnlock(); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9054) Race condition on yieldCount in FSDirectory.java
[ https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-9054: --- Attachment: HDFS-9054.patch This patch use AtomicLong to prevent race condition. > Race condition on yieldCount in FSDirectory.java > > > Key: HDFS-9054 > URL: https://issues.apache.org/jira/browse/HDFS-9054 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-9054.patch > > > getContentSummaryInt only held read lock, and it called fsd.addYieldCount > which may cause race condition: > {code} > private static ContentSummary getContentSummaryInt(FSDirectory fsd, > INodesInPath iip) throws IOException { > fsd.readLock(); > try { > INode targetNode = iip.getLastINode(); > if (targetNode == null) { > throw new FileNotFoundException("File does not exist: " + > iip.getPath()); > } > else { > // Make it relinquish locks everytime contentCountLimit entries are > // processed. 0 means disabled. I.e. blocking for the entire duration. > ContentSummaryComputationContext cscc = > new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(), > fsd.getContentCountLimit(), fsd.getContentSleepMicroSec()); > ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc); > fsd.addYieldCount(cscc.getYieldCount()); > return cs; > } > } finally { > fsd.readUnlock(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9054) Race condition on yieldCount in FSDirectory.java
[ https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-9054: --- Affects Version/s: 2.7.0 Status: Patch Available (was: Open) > Race condition on yieldCount in FSDirectory.java > > > Key: HDFS-9054 > URL: https://issues.apache.org/jira/browse/HDFS-9054 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-9054.patch > > > getContentSummaryInt only held read lock, and it called fsd.addYieldCount > which may cause race condition: > {code} > private static ContentSummary getContentSummaryInt(FSDirectory fsd, > INodesInPath iip) throws IOException { > fsd.readLock(); > try { > INode targetNode = iip.getLastINode(); > if (targetNode == null) { > throw new FileNotFoundException("File does not exist: " + > iip.getPath()); > } > else { > // Make it relinquish locks everytime contentCountLimit entries are > // processed. 0 means disabled. I.e. blocking for the entire duration. > ContentSummaryComputationContext cscc = > new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(), > fsd.getContentCountLimit(), fsd.getContentSleepMicroSec()); > ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc); > fsd.addYieldCount(cscc.getYieldCount()); > return cs; > } > } finally { > fsd.readUnlock(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9054) Race condition on yieldCount in FSDirectory.java
[ https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742821#comment-14742821 ] zhaoyunjiong commented on HDFS-9054: Agree, thanks for your time. > Race condition on yieldCount in FSDirectory.java > > > Key: HDFS-9054 > URL: https://issues.apache.org/jira/browse/HDFS-9054 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-9054.patch > > > getContentSummaryInt only held read lock, and it called fsd.addYieldCount > which may cause race condition: > {code} > private static ContentSummary getContentSummaryInt(FSDirectory fsd, > INodesInPath iip) throws IOException { > fsd.readLock(); > try { > INode targetNode = iip.getLastINode(); > if (targetNode == null) { > throw new FileNotFoundException("File does not exist: " + > iip.getPath()); > } > else { > // Make it relinquish locks everytime contentCountLimit entries are > // processed. 0 means disabled. I.e. blocking for the entire duration. > ContentSummaryComputationContext cscc = > new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(), > fsd.getContentCountLimit(), fsd.getContentSleepMicroSec()); > ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc); > fsd.addYieldCount(cscc.getYieldCount()); > return cs; > } > } finally { > fsd.readUnlock(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9054) Race condition on yieldCount in FSDirectory.java
[ https://issues.apache.org/jira/browse/HDFS-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-9054: --- Resolution: Won't Fix Status: Resolved (was: Patch Available) > Race condition on yieldCount in FSDirectory.java > > > Key: HDFS-9054 > URL: https://issues.apache.org/jira/browse/HDFS-9054 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-9054.patch > > > getContentSummaryInt only held read lock, and it called fsd.addYieldCount > which may cause race condition: > {code} > private static ContentSummary getContentSummaryInt(FSDirectory fsd, > INodesInPath iip) throws IOException { > fsd.readLock(); > try { > INode targetNode = iip.getLastINode(); > if (targetNode == null) { > throw new FileNotFoundException("File does not exist: " + > iip.getPath()); > } > else { > // Make it relinquish locks everytime contentCountLimit entries are > // processed. 0 means disabled. I.e. blocking for the entire duration. > ContentSummaryComputationContext cscc = > new ContentSummaryComputationContext(fsd, fsd.getFSNamesystem(), > fsd.getContentCountLimit(), fsd.getContentSleepMicroSec()); > ContentSummary cs = targetNode.computeAndConvertContentSummary(cscc); > fsd.addYieldCount(cscc.getYieldCount()); > return cs; > } > } finally { > fsd.readUnlock(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
zhaoyunjiong created HDFS-5367: -- Summary: Restore fsimage locked NameNode too long when the size of fsimage are big Key: HDFS-5367 URL: https://issues.apache.org/jira/browse/HDFS-5367 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Our cluster have 40G fsimage, we write one copy of edit log to NFS. After NFS temporary failed, when doing checkpoint, NameNode try to recover it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5367: --- Attachment: (was: HDFS-5367) > Restore fsimage locked NameNode too long when the size of fsimage are big > - > > Key: HDFS-5367 > URL: https://issues.apache.org/jira/browse/HDFS-5367 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > > Our cluster have 40G fsimage, we write one copy of edit log to NFS. > After NFS temporary failed, when doing checkpoint, NameNode try to recover > it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = > 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5367: --- Attachment: HDFS-5367 The fsimage restored when SecondaryNameNode call rollEditLog will be replaced soon when SecondaryNameNode call rollFsImage. So I think restore fsimage is not necessary. > Restore fsimage locked NameNode too long when the size of fsimage are big > - > > Key: HDFS-5367 > URL: https://issues.apache.org/jira/browse/HDFS-5367 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > > Our cluster have 40G fsimage, we write one copy of edit log to NFS. > After NFS temporary failed, when doing checkpoint, NameNode try to recover > it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = > 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5367: --- Attachment: HDFS-5367-branch-1.2.patch This patch avoid restore fsimage to make rollEditLog finished as soon as possible. > Restore fsimage locked NameNode too long when the size of fsimage are big > - > > Key: HDFS-5367 > URL: https://issues.apache.org/jira/browse/HDFS-5367 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5367-branch-1.2.patch > > > Our cluster have 40G fsimage, we write one copy of edit log to NFS. > After NFS temporary failed, when doing checkpoint, NameNode try to recover > it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = > 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5367) Restoring namenode storage locks namenode due to unnecessary fsimage write
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798610#comment-13798610 ] zhaoyunjiong commented on HDFS-5367: Thank you for your review. > Restoring namenode storage locks namenode due to unnecessary fsimage write > -- > > Key: HDFS-5367 > URL: https://issues.apache.org/jira/browse/HDFS-5367 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 1.2.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 1.3.0 > > Attachments: HDFS-5367-branch-1.2.patch > > > Our cluster have 40G fsimage, we write one copy of edit log to NFS. > After NFS temporary failed, when doing checkpoint, NameNode try to recover > it, and it will save 40G fsimage to NFS, it takes some time (> 40G/128MB/s = > 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
zhaoyunjiong created HDFS-5396: -- Summary: FSImage.getFsImageName should check whether fsimage exists Key: HDFS-5396 URL: https://issues.apache.org/jira/browse/HDFS-5396 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to all IMAGE dir, so we need to check whether fsimage exists before FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5396: --- Attachment: HDFS-5396-branch-1.2.patch Check whether fsimage exists before return. > FSImage.getFsImageName should check whether fsimage exists > -- > > Key: HDFS-5396 > URL: https://issues.apache.org/jira/browse/HDFS-5396 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 1.3.0 > > Attachments: HDFS-5396-branch-1.2.patch > > > In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to > all IMAGE dir, so we need to check whether fsimage exists before > FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong resolved HDFS-5396. Resolution: Not A Problem > FSImage.getFsImageName should check whether fsimage exists > -- > > Key: HDFS-5396 > URL: https://issues.apache.org/jira/browse/HDFS-5396 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 1.3.0 > > Attachments: HDFS-5396-branch-1.2.patch > > > In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to > all IMAGE dir, so we need to check whether fsimage exists before > FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802545#comment-13802545 ] zhaoyunjiong commented on HDFS-5396: The first image storage dir always have fsimage file in it. Restored image storage always append to the end. So the first one must have fsimage in it. > FSImage.getFsImageName should check whether fsimage exists > -- > > Key: HDFS-5396 > URL: https://issues.apache.org/jira/browse/HDFS-5396 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 1.3.0 > > Attachments: HDFS-5396-branch-1.2.patch > > > In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to > all IMAGE dir, so we need to check whether fsimage exists before > FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5579) Under construction files make DataNode decommission take very long hours
zhaoyunjiong created HDFS-5579: -- Summary: Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0, 1.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(List> blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579.patch HDFS-5579-branch-1.2.patch This patch let NameNode can replicate blocks belongs to under construction files but not the last block. And if the decommissioning DataNodes only have some blocks which are the last blocks of under construction files and have more than 1 live replicates left behind, then NameNode could set it to DECOMMISSIONED. > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579-branch-1.2.patch HDFS-5579.patch Thanks Vinay. Update patch as your comments. Except: getLastBlock do throws IOException, I deleted it in this patch. > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, > HDFS-5579.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579-branch-1.2.patch) > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579.patch HDFS-5579-branch-1.2.patch Update patch, added test case for trunk. > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, > HDFS-5579.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579-branch-1.2.patch) > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865202#comment-13865202 ] zhaoyunjiong commented on HDFS-5579: It's already in the patch. +if (bc.isUnderConstruction()) { + if (block.equals(bc.getLastBlock()) && curReplicas > minReplication) { +continue; + } + underReplicatedInOpenFiles++; +} > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579-branch-1.2.patch HDFS-5579.patch Good point. Thanks Jing. Update patches to fix this problem. > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, > HDFS-5579.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579-branch-1.2.patch) > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579.patch > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870385#comment-13870385 ] zhaoyunjiong commented on HDFS-5579: Thanks for your time to review the patch, Jing. > Under construction files make DataNode decommission take very long hours > > > Key: HDFS-5579 > URL: https://issues.apache.org/jira/browse/HDFS-5579 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 2.4.0 > > Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch > > > We noticed that some times decommission DataNodes takes very long time, even > exceeds 100 hours. > After check the code, I found that in > BlockManager:computeReplicationWorkForBlocks(List> > blocksToReplicate) it won't replicate blocks which belongs to under > construction files, however in > BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there > is block need replicate no matter whether it belongs to under construction or > not, the decommission progress will continue running. > That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-2139: --- Attachment: HDFS-2139.patch Thanks Guo Ruijing & Daryn Sharp for your time. Update patch according to the comments: 1. add clone in DistributedFileSystem 2. add check block tokens 3. support clone part of the file, the last block still use hardlink, then use truncateBlock to adjust block size and meta file. Yes, DN enforce no linking of UC blocks. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania > Attachments: HDFS-2139.patch, HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6616) bestNode shouldn't always return the first DataNode
zhaoyunjiong created HDFS-6616: -- Summary: bestNode shouldn't always return the first DataNode Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.patch One possible solution is choose DataNode randomly with the cost of ignore the network distance. > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.patch > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: (was: HDFS-6616.patch) > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049669#comment-14049669 ] zhaoyunjiong commented on HDFS-6616: What happened on our cluster is very rare case. Server use HDP2.1 and client use HDP1.3, so I come up this patch. Correct me if I'm wrong: when using WebHDFS, I think it will be very rare that both client and the data will be in the same host. But I agree with you support exclude nodes in WebHDFS is a better idea. > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-2.patch Thanks Daryn Sharp for your time. Update patch, use boolean instead of Boolean. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051076#comment-14051076 ] zhaoyunjiong commented on HDFS-6616: Yes. You are right. I never thought user may use WebHDFS as source and target filesystem, and running distcp job on source cluster. For our use case, we always run jobs on target cluster and use WebHDFS as source filesystem. > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.1.patch Update patch to support exclude nodes in WebHDFS. > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.1.patch, HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.2.patch Thanks Tsz Wo Nicholas Sze & Jing Zhao. Update patch according to comments: change ExcludeDatanodesParam.NAME to "excludedatanodes" and change WebHdfsFileSystem to use the exclude datanode feature. The test failures is not related. > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.3.patch Update patch according to comments and fix test failures. > bestNode shouldn't always return the first DataNode > --- > > Key: HDFS-6616 > URL: https://issues.apache.org/jira/browse/HDFS-6616 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.3.patch, > HDFS-6616.patch > > > When we are doing distcp between clusters, job failed: > 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL > part-r-00101.avro : java.net.NoRouteToHostException: No route to host > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) > at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) > at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > The root reason is one of the DataNode can't access from outside, but inside > cluster, it's health. > In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, > so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster
zhaoyunjiong created HDFS-6829: -- Summary: DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster Key: HDFS-6829 URL: https://issues.apache.org/jira/browse/HDFS-6829 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 2.4.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor When we run command "hadoop dfsadmin -refreshSuperUserGroupsConfiguration", it failed and report below message: 14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The serverPrincipal = doesn't confirm to the standards refreshSuperUserGroupsConfiguration: null After check the code, I found the bug was triggered by below reasons: 1. We didn't set CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, which needed by RefreshUserMappingsProtocol. And in DFSAdmin, if no CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will try to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, "")); 2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in hdfs-site.xml 3. DFSAdmin didn't load hdfs-site.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster
[ https://issues.apache.org/jira/browse/HDFS-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6829: --- Attachment: HDFS-6829.patch This patch is very simple, use HdfsConfiguration to load hdfs-site.xml when construct DFSAdmin. > DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster > --- > > Key: HDFS-6829 > URL: https://issues.apache.org/jira/browse/HDFS-6829 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Affects Versions: 2.4.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong >Priority: Minor > Attachments: HDFS-6829.patch > > > When we run command "hadoop dfsadmin -refreshSuperUserGroupsConfiguration", > it failed and report below message: > 14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The > serverPrincipal = doesn't confirm to the standards > refreshSuperUserGroupsConfiguration: null > After check the code, I found the bug was triggered by below reasons: > 1. We didn't set > CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, which needed > by RefreshUserMappingsProtocol. And in DFSAdmin, if no > CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will > try to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: > conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, > conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, "")); > 2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in > hdfs-site.xml > 3. DFSAdmin didn't load hdfs-site.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy
zhaoyunjiong created HDFS-7044: -- Summary: Support retention policy based on access time and modify time, use XAttr to store policy Key: HDFS-7044 URL: https://issues.apache.org/jira/browse/HDFS-7044 Project: Hadoop HDFS Issue Type: New Feature Components: namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong The basic idea is set retention policy on directory based on access time and modify time and use XAttr to store policy. Files under directory which have retention policy will be delete if meet the retention rule. There are three rule: # access time #* If (accessTime + retentionTimeForAccess < now), the file will be delete # modify time #* If (modifyTime + retentionTimeForModify < now), the file will be delete # access time and modify time #* If (accessTime + retentionTimeForAccess < now && modifyTime + retentionTimeForModify < now ), the file will be delete -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy
[ https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7044: --- Attachment: Retention policy design.pdf Attach a simple design document. The major difference between HDFS-7044 and HDFS-6382 are(Please correct me if I'm wrong, I just knew HDFS-6382 was trying to solve same problem): # HDFS-6382 is standalone daemon outside NameNode, HDFS-7044 will be inside NameNode, I believe HDFS-7044 will be more simple and efficient. # HDFS-7044 allows user set policy based on access time or modify time, HDFS-6382 only support one ttl. > Support retention policy based on access time and modify time, use XAttr to > store policy > > > Key: HDFS-7044 > URL: https://issues.apache.org/jira/browse/HDFS-7044 > Project: Hadoop HDFS > Issue Type: New Feature > Components: namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: Retention policy design.pdf > > > The basic idea is set retention policy on directory based on access time and > modify time and use XAttr to store policy. > Files under directory which have retention policy will be delete if meet the > retention rule. > There are three rule: > # access time > #* If (accessTime + retentionTimeForAccess < now), the file will be delete > # modify time > #* If (modifyTime + retentionTimeForModify < now), the file will be delete > # access time and modify time > #* If (accessTime + retentionTimeForAccess < now && modifyTime + > retentionTimeForModify < now ), the file will be delete -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy
[ https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong resolved HDFS-7044. Resolution: Duplicate Thanks Allen Wittenauer and Zesheng Wu. After I read the comments in HDFS-6382, now I understand the concerns. > Support retention policy based on access time and modify time, use XAttr to > store policy > > > Key: HDFS-7044 > URL: https://issues.apache.org/jira/browse/HDFS-7044 > Project: Hadoop HDFS > Issue Type: New Feature > Components: namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: Retention policy design.pdf > > > The basic idea is set retention policy on directory based on access time and > modify time and use XAttr to store policy. > Files under directory which have retention policy will be delete if meet the > retention rule. > There are three rule: > # access time > #* If (accessTime + retentionTimeForAccess < now), the file will be delete > # modify time > #* If (modifyTime + retentionTimeForModify < now), the file will be delete > # access time and modify time > #* If (accessTime + retentionTimeForAccess < now && modifyTime + > retentionTimeForModify < now ), the file will be delete -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-3.patch Update patch for merge the trunk. {quote} Why we always pass false in below? 1653new Sender(out).writeBlock(b, accessToken, clientname, targets, 1654srcNode, stage, 0, 0, 0, 0, blockSender.getChecksum(), 1655cachingStrategy, false); {quote} This code path happens when NameNode ask DataNode send block to other DataNode(DatanodeProtocol.DNA_TRANSFER), it's not trigged by client, so there is no need pinning the block in this case. {quote} We will never copy a block? 925 if (datanode.data.getPinning(block)) 926 String msg = "Not able to copy block " + block.getBlockId() + " " + 927 "to " + peer.getRemoteAddressString() + " because it's pinned "; 928 LOG.info(msg); 929 sendResponse(ERROR, msg); Any thing to help ensure replica count does not rot when this pinning is enabled? {quote} When the block is under replicate, NameNode will send DatanodeProtocol.DNA_TRANSFER command to DataNode and it handled by DataTransfer, pinning won't affect that. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: (was: HDFS-6133-3.patch) > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-3.patch Update patch, merge with trunk. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-4.patch Thanks Yongjun Zhang. Update patch according to comments. {quote} The concept of favoredNodes pre-existed before your patch, now your patch defines that as long as favoredNodes is passed, then pinning will be done. So we are changing the prior definition of how favoredNodes are used. Why not add some additional interface to tell that pinning will happen so we have the option not to pin even if favoredNodes is passed? Not necessarily you need to do what I suggested here, but I'd like to understand your thoughts here. {quote} I think most of time if you use favoredNodes, you'd like to keep the block on that machine, so to keep things simple, I didn't add new interface. {quote} Do we ever need interface to do unpinning? {quote} We can add unpinning in another issue if there are user case need that. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck
zhaoyunjiong created HDFS-7429: -- Summary: DomainSocketWatcher.doPoll0 stuck Key: HDFS-7429 URL: https://issues.apache.org/jira/browse/HDFS-7429 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong I found some of our DataNodes will run "exceeds the limit of concurrent xciever", the limit is 4K. After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck: {quote} "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000740df9c90> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable [0x7f558d3d2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000740df9cb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) "Thread-163852" daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable [0x7f55aef6e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52) at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457) at java.lang.Thre
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Attachment: 11241025 11241023 11241021 Upload more stack trace files. > DomainSocketWatcher.doPoll0 stuck > - > > Key: HDFS-7429 > URL: https://issues.apache.org/jira/browse/HDFS-7429 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong > Attachments: 11241021, 11241023, 11241025 > > > I found some of our DataNodes will run "exceeds the limit of concurrent > xciever", the limit is 4K. > After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck: > {quote} > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition > [0x7f558d5d4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9c90> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) > at > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > -- > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable > [0x7f558d3d2000] >java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) > at > org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) > at > org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition > [0x7f558d7d6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9cb0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Summary: DomainSocketWatcher.kick stuck (was: DomainSocketWatcher.doPoll0 stuck) > DomainSocketWatcher.kick stuck > -- > > Key: HDFS-7429 > URL: https://issues.apache.org/jira/browse/HDFS-7429 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong > Attachments: 11241021, 11241023, 11241025 > > > I found some of our DataNodes will run "exceeds the limit of concurrent > xciever", the limit is 4K. > After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck: > {quote} > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition > [0x7f558d5d4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9c90> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) > at > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > -- > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable > [0x7f558d3d2000] >java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) > at > org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) > at > org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition > [0x7f558d7d6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9cb0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:745) >
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Description: I found some of our DataNodes will run "exceeds the limit of concurrent xciever", the limit is 4K. After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck: {quote} "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000740df9c90> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable [0x7f558d3d2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000740df9cb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) "Thread-163852" daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable [0x7f55aef6e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52) at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457) at java.lang.Thread.run(Thread.java:745) {quote} was: I found
[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224249#comment-14224249 ] zhaoyunjiong commented on HDFS-7429: The previous description is not right. The stuck thread happened at org.apache.hadoop.net.unix.DomainSocket.writeArray0 as below shows. {quote} $ grep -B2 -A10 DomainSocket.writeArray 1124102* 11241021-"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] 11241021- java.lang.Thread.State: RUNNABLE 11241021: at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) 11241021- at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) 11241021- at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) 11241021- at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) 11241021- at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) 11241021- at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) 11241021- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) 11241021- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) 11241021- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) 11241021- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) 11241021- at java.lang.Thread.run(Thread.java:745) -- -- 11241023-"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] 11241023- java.lang.Thread.State: RUNNABLE 11241023: at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) 11241023- at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) 11241023- at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) 11241023- at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) 11241023- at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) 11241023- at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) 11241023- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) 11241023- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) 11241023- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) 11241023- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) 11241023- at java.lang.Thread.run(Thread.java:745) -- -- 11241025-"DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] 11241025- java.lang.Thread.State: RUNNABLE 11241025: at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) 11241025- at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) 11241025- at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) 11241025- at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) 11241025- at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) 11241025- at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) 11241025- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) 11241025- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) 11241025- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) 11241025- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) 11241025- at java.lang.Thread.run(Thread.java:745) {quote} > DomainSocketWatcher.kick stuck > -- > > Key: HDFS-7429 > URL: https://issues.apache.org/jira/browse/HDFS-7429 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong > Attachments: 11241021, 11241023, 11241025 > > > I found some of our DataNodes will run "exceeds the limit of concurrent > xciever", the limit is 4K. > After check the stack, I suspect that > org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by >
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Description: I found some of our DataNodes will run "exceeds the limit of concurrent xciever", the limit is 4K. After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck: {quote} "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000740df9c90> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000740df9cb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) "Thread-163852" daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable [0x7f55aef6e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52) at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457) at java.lang.Thr
[jira] [Assigned] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong reassigned HDFS-7429: -- Assignee: zhaoyunjiong > DomainSocketWatcher.kick stuck > -- > > Key: HDFS-7429 > URL: https://issues.apache.org/jira/browse/HDFS-7429 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: 11241021, 11241023, 11241025 > > > I found some of our DataNodes will run "exceeds the limit of concurrent > xciever", the limit is 4K. > After check the stack, I suspect that > org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by > DomainSocketWatcher.kick stuck: > {quote} > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition > [0x7f558d5d4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9c90> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) > at > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > -- > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable > [0x7f7db06c5000] >java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) > at > org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) > at > org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:745) > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition > [0x7f558d7d6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9cb0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.
[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224325#comment-14224325 ] zhaoyunjiong commented on HDFS-7429: The problem here is in our machine we can only send 299 bytes to domain socket. When it try to send the 300 byte, it will block, and the DomainSocketWatcher.add(DomainSocket sock, Handler handler) have the lock, so watcherThread.run can't get the lock and clear the buffer, it's a live lock. I'm not sure which configuration controls the bufferSize of 299 for now. Now I suspect net.core.netdev_budget, which is 300 at our machines. I'll upload a patch to control the send bytes to prevent live lock later. By the way, should I move this to HADOOP COMMON project? > DomainSocketWatcher.kick stuck > -- > > Key: HDFS-7429 > URL: https://issues.apache.org/jira/browse/HDFS-7429 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong > Attachments: 11241021, 11241023, 11241025 > > > I found some of our DataNodes will run "exceeds the limit of concurrent > xciever", the limit is 4K. > After check the stack, I suspect that > org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by > DomainSocketWatcher.kick stuck: > {quote} > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition > [0x7f558d5d4000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9c90> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) > at > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > -- > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable > [0x7f7db06c5000] >java.lang.Thread.State: RUNNABLE > at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) > at > org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) > at > org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:745) > "DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation > #1]" daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition > [0x7f558d7d6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000740df9cb0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > org.apache.hadoop.net.unix.DomainSocketWatche
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-5.patch Thanks Yongjun Zhang, update patch to fix the format. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile
zhaoyunjiong created HDFS-7470: -- Summary: SecondaryNameNode need twice memory when calling reloadFromImageFile Key: HDFS-7470 URL: https://issues.apache.org/jira/browse/HDFS-7470 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Assignee: zhaoyunjiong histo information at 2014-12-02 01:19 {quote} num #instances #bytes class name -- 1: 18644963019326123016 [Ljava.lang.Object; 2: 15736664915107198304 org.apache.hadoop.hdfs.server.namenode.INodeFile 3: 18340903011738177920 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 4: 157358401 5244264024 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 5: 3 3489661000 [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; 6: 29253275 1872719664 [B 7: 3230821 284312248 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 8: 2756284 110251360 java.util.ArrayList 9:469158 22519584 org.apache.hadoop.fs.permission.AclEntry 10: 847 17133032 [Ljava.util.HashMap$Entry; 11:188471 17059632 [C 12:314614 10067656 [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; 13:2345799383160 com.google.common.collect.RegularImmutableList 14: 495846850280 15: 495846356704 16:1872705992640 java.lang.String 17:2345795629896 org.apache.hadoop.hdfs.server.namenode.AclFeature {quote} histo information at 2014-12-02 01:32 {quote} num #instances #bytes class name -- 1: 35583805135566651032 [Ljava.lang.Object; 2: 30227275829018184768 org.apache.hadoop.hdfs.server.namenode.INodeFile 3: 35250072322560046272 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 4: 30226451010075087952 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 5: 177120233 9374983920 [B 6: 3 3489661000 [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; 7: 6191688 544868544 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 8: 2799256 111970240 java.util.ArrayList 9:890728 42754944 org.apache.hadoop.fs.permission.AclEntry 10:330986 29974408 [C 11:596871 19099880 [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; 12:445364 17814560 com.google.common.collect.RegularImmutableList 13: 844 17132816 [Ljava.util.HashMap$Entry; 14:445364 10688736 org.apache.hadoop.hdfs.server.namenode.AclFeature 15:329789 10553248 java.lang.String 16: 917418807136 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction 17: 495846850280 {quote} And the stack trace shows it was doing reloadFromImageFile: {quote} at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888) at org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1630) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:350) at java.lang.Thread.run(Thread.java:745) {quote} So before doing reloadFromImageFile, I thi
[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile
[ https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7470: --- Attachment: HDFS-7470.patch This patch re-init namesystem before doing reloadFromImageFile. > SecondaryNameNode need twice memory when calling reloadFromImageFile > > > Key: HDFS-7470 > URL: https://issues.apache.org/jira/browse/HDFS-7470 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7470.patch > > > histo information at 2014-12-02 01:19 > {quote} > num #instances #bytes class name > -- >1: 18644963019326123016 [Ljava.lang.Object; >2: 15736664915107198304 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 18340903011738177920 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 157358401 5244264024 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >6: 29253275 1872719664 [B >7: 3230821 284312248 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2756284 110251360 java.util.ArrayList >9:469158 22519584 org.apache.hadoop.fs.permission.AclEntry > 10: 847 17133032 [Ljava.util.HashMap$Entry; > 11:188471 17059632 [C > 12:314614 10067656 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 13:2345799383160 > com.google.common.collect.RegularImmutableList > 14: 495846850280 > 15: 495846356704 > 16:1872705992640 java.lang.String > 17:2345795629896 > org.apache.hadoop.hdfs.server.namenode.AclFeature > {quote} > histo information at 2014-12-02 01:32 > {quote} > num #instances #bytes class name > -- >1: 35583805135566651032 [Ljava.lang.Object; >2: 30227275829018184768 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 35250072322560046272 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 30226451010075087952 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 177120233 9374983920 [B >6: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >7: 6191688 544868544 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2799256 111970240 java.util.ArrayList >9:890728 42754944 org.apache.hadoop.fs.permission.AclEntry > 10:330986 29974408 [C > 11:596871 19099880 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 12:445364 17814560 > com.google.common.collect.RegularImmutableList > 13: 844 17132816 [Ljava.util.HashMap$Entry; > 14:445364 10688736 > org.apache.hadoop.hdfs.server.namenode.AclFeature > 15:329789 10553248 java.lang.String > 16: 917418807136 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction > 17: 495846850280 > {quote} > And the stack trace shows it was doing reloadFromImageFile: > {quote} > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354) > at java.security.AccessController.doPrivileged(Na
[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile
[ https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7470: --- Status: Patch Available (was: Open) > SecondaryNameNode need twice memory when calling reloadFromImageFile > > > Key: HDFS-7470 > URL: https://issues.apache.org/jira/browse/HDFS-7470 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7470.patch > > > histo information at 2014-12-02 01:19 > {quote} > num #instances #bytes class name > -- >1: 18644963019326123016 [Ljava.lang.Object; >2: 15736664915107198304 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 18340903011738177920 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 157358401 5244264024 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >6: 29253275 1872719664 [B >7: 3230821 284312248 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2756284 110251360 java.util.ArrayList >9:469158 22519584 org.apache.hadoop.fs.permission.AclEntry > 10: 847 17133032 [Ljava.util.HashMap$Entry; > 11:188471 17059632 [C > 12:314614 10067656 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 13:2345799383160 > com.google.common.collect.RegularImmutableList > 14: 495846850280 > 15: 495846356704 > 16:1872705992640 java.lang.String > 17:2345795629896 > org.apache.hadoop.hdfs.server.namenode.AclFeature > {quote} > histo information at 2014-12-02 01:32 > {quote} > num #instances #bytes class name > -- >1: 35583805135566651032 [Ljava.lang.Object; >2: 30227275829018184768 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 35250072322560046272 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 30226451010075087952 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 177120233 9374983920 [B >6: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >7: 6191688 544868544 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2799256 111970240 java.util.ArrayList >9:890728 42754944 org.apache.hadoop.fs.permission.AclEntry > 10:330986 29974408 [C > 11:596871 19099880 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 12:445364 17814560 > com.google.common.collect.RegularImmutableList > 13: 844 17132816 [Ljava.util.HashMap$Entry; > 14:445364 10688736 > org.apache.hadoop.hdfs.server.namenode.AclFeature > 15:329789 10553248 java.lang.String > 16: 917418807136 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction > 17: 495846850280 > {quote} > And the stack trace shows it was doing reloadFromImageFile: > {quote} > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs
[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile
[ https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7470: --- Attachment: HDFS-7470.1.patch Update patch to fix test failure. > SecondaryNameNode need twice memory when calling reloadFromImageFile > > > Key: HDFS-7470 > URL: https://issues.apache.org/jira/browse/HDFS-7470 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7470.1.patch, HDFS-7470.patch > > > histo information at 2014-12-02 01:19 > {quote} > num #instances #bytes class name > -- >1: 18644963019326123016 [Ljava.lang.Object; >2: 15736664915107198304 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 18340903011738177920 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 157358401 5244264024 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >6: 29253275 1872719664 [B >7: 3230821 284312248 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2756284 110251360 java.util.ArrayList >9:469158 22519584 org.apache.hadoop.fs.permission.AclEntry > 10: 847 17133032 [Ljava.util.HashMap$Entry; > 11:188471 17059632 [C > 12:314614 10067656 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 13:2345799383160 > com.google.common.collect.RegularImmutableList > 14: 495846850280 > 15: 495846356704 > 16:1872705992640 java.lang.String > 17:2345795629896 > org.apache.hadoop.hdfs.server.namenode.AclFeature > {quote} > histo information at 2014-12-02 01:32 > {quote} > num #instances #bytes class name > -- >1: 35583805135566651032 [Ljava.lang.Object; >2: 30227275829018184768 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 35250072322560046272 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 30226451010075087952 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 177120233 9374983920 [B >6: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >7: 6191688 544868544 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2799256 111970240 java.util.ArrayList >9:890728 42754944 org.apache.hadoop.fs.permission.AclEntry > 10:330986 29974408 [C > 11:596871 19099880 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 12:445364 17814560 > com.google.common.collect.RegularImmutableList > 13: 844 17132816 [Ljava.util.HashMap$Entry; > 14:445364 10688736 > org.apache.hadoop.hdfs.server.namenode.AclFeature > 15:329789 10553248 java.lang.String > 16: 917418807136 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction > 17: 495846850280 > {quote} > And the stack trace shows it was doing reloadFromImageFile: > {quote} > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:354) > at java.security.AccessController.doPrivileged(Native Meth
[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile
[ https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7470: --- Attachment: secondaryNameNode.jstack.txt Thanks Chris Nauroth for your time. Upload a stack trace file for SecondaryNameNode. Correct me if I'm wrong, from stack trace, I think there won't have two threads hold FSNamesystem.writeLock. And SecondaryNameNode didn't start service like BlockManager and CacheManager. For the edit log, SecondaryNameNode won't open it for write. I'll check again whether I missed some risk or try to find out a more safer solution later. > SecondaryNameNode need twice memory when calling reloadFromImageFile > > > Key: HDFS-7470 > URL: https://issues.apache.org/jira/browse/HDFS-7470 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7470.1.patch, HDFS-7470.patch, > secondaryNameNode.jstack.txt > > > histo information at 2014-12-02 01:19 > {quote} > num #instances #bytes class name > -- >1: 18644963019326123016 [Ljava.lang.Object; >2: 15736664915107198304 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 18340903011738177920 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 157358401 5244264024 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >6: 29253275 1872719664 [B >7: 3230821 284312248 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2756284 110251360 java.util.ArrayList >9:469158 22519584 org.apache.hadoop.fs.permission.AclEntry > 10: 847 17133032 [Ljava.util.HashMap$Entry; > 11:188471 17059632 [C > 12:314614 10067656 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 13:2345799383160 > com.google.common.collect.RegularImmutableList > 14: 495846850280 > 15: 495846356704 > 16:1872705992640 java.lang.String > 17:2345795629896 > org.apache.hadoop.hdfs.server.namenode.AclFeature > {quote} > histo information at 2014-12-02 01:32 > {quote} > num #instances #bytes class name > -- >1: 35583805135566651032 [Ljava.lang.Object; >2: 30227275829018184768 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 35250072322560046272 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 30226451010075087952 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 177120233 9374983920 [B >6: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >7: 6191688 544868544 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2799256 111970240 java.util.ArrayList >9:890728 42754944 org.apache.hadoop.fs.permission.AclEntry > 10:330986 29974408 [C > 11:596871 19099880 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 12:445364 17814560 > com.google.common.collect.RegularImmutableList > 13: 844 17132816 [Ljava.util.HashMap$Entry; > 14:445364 10688736 > org.apache.hadoop.hdfs.server.namenode.AclFeature > 15:329789 10553248 java.lang.String > 16: 917418807136 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction > 17: 495846850280 > {quote} > And the stack trace shows it was doing reloadFromImageFile: > {quote} > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562) > at > org.apache.hadoop.hdfs.server.namenode.Seco
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-6.patch Thanks Kai Zheng for point out the bug in TestBalancer. Fixed in new patch. Also update the comment mentioned. For mover, it use replaceBlock same as balancer, so mover will not able to move blocks as well. For torage type/policy, Kihwal Lee already answered the question: {quote} We will make additional changes in separate jiras to make NN aware of favored nodes. I think the patch in this jira is fine as a stepping stone for the further work. {quote} > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile
[ https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7470: --- Attachment: HDFS-7470.2.patch This patch will clear BlocksMap in FSNamesystem.clear(). I believe this should release the memory. > SecondaryNameNode need twice memory when calling reloadFromImageFile > > > Key: HDFS-7470 > URL: https://issues.apache.org/jira/browse/HDFS-7470 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7470.1.patch, HDFS-7470.2.patch, HDFS-7470.patch, > secondaryNameNode.jstack.txt > > > histo information at 2014-12-02 01:19 > {quote} > num #instances #bytes class name > -- >1: 18644963019326123016 [Ljava.lang.Object; >2: 15736664915107198304 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 18340903011738177920 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 157358401 5244264024 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >6: 29253275 1872719664 [B >7: 3230821 284312248 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2756284 110251360 java.util.ArrayList >9:469158 22519584 org.apache.hadoop.fs.permission.AclEntry > 10: 847 17133032 [Ljava.util.HashMap$Entry; > 11:188471 17059632 [C > 12:314614 10067656 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 13:2345799383160 > com.google.common.collect.RegularImmutableList > 14: 495846850280 > 15: 495846356704 > 16:1872705992640 java.lang.String > 17:2345795629896 > org.apache.hadoop.hdfs.server.namenode.AclFeature > {quote} > histo information at 2014-12-02 01:32 > {quote} > num #instances #bytes class name > -- >1: 35583805135566651032 [Ljava.lang.Object; >2: 30227275829018184768 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 35250072322560046272 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 30226451010075087952 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 177120233 9374983920 [B >6: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >7: 6191688 544868544 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2799256 111970240 java.util.ArrayList >9:890728 42754944 org.apache.hadoop.fs.permission.AclEntry > 10:330986 29974408 [C > 11:596871 19099880 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 12:445364 17814560 > com.google.common.collect.RegularImmutableList > 13: 844 17132816 [Ljava.util.HashMap$Entry; > 14:445364 10688736 > org.apache.hadoop.hdfs.server.namenode.AclFeature > 15:329789 10553248 java.lang.String > 16: 917418807136 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction > 17: 495846850280 > {quote} > And the stack trace shows it was doing reloadFromImageFile: > {quote} > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388) > at > org.apache.hadoop.hdfs.server.namenode.S
[jira] [Commented] (HDFS-7470) SecondaryNameNode need twice memory when calling reloadFromImageFile
[ https://issues.apache.org/jira/browse/HDFS-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276347#comment-14276347 ] zhaoyunjiong commented on HDFS-7470: Chris, thanks for your time. > SecondaryNameNode need twice memory when calling reloadFromImageFile > > > Key: HDFS-7470 > URL: https://issues.apache.org/jira/browse/HDFS-7470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 2.7.0 > > Attachments: HDFS-7470.1.patch, HDFS-7470.2.patch, HDFS-7470.patch, > secondaryNameNode.jstack.txt > > > histo information at 2014-12-02 01:19 > {quote} > num #instances #bytes class name > -- >1: 18644963019326123016 [Ljava.lang.Object; >2: 15736664915107198304 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 18340903011738177920 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 157358401 5244264024 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >6: 29253275 1872719664 [B >7: 3230821 284312248 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2756284 110251360 java.util.ArrayList >9:469158 22519584 org.apache.hadoop.fs.permission.AclEntry > 10: 847 17133032 [Ljava.util.HashMap$Entry; > 11:188471 17059632 [C > 12:314614 10067656 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 13:2345799383160 > com.google.common.collect.RegularImmutableList > 14: 495846850280 > 15: 495846356704 > 16:1872705992640 java.lang.String > 17:2345795629896 > org.apache.hadoop.hdfs.server.namenode.AclFeature > {quote} > histo information at 2014-12-02 01:32 > {quote} > num #instances #bytes class name > -- >1: 35583805135566651032 [Ljava.lang.Object; >2: 30227275829018184768 > org.apache.hadoop.hdfs.server.namenode.INodeFile >3: 35250072322560046272 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo >4: 30226451010075087952 > [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; >5: 177120233 9374983920 [B >6: 3 3489661000 > [Lorg.apache.hadoop.util.LightWeightGSet$LinkedElement; >7: 6191688 544868544 > org.apache.hadoop.hdfs.server.namenode.INodeDirectory >8: 2799256 111970240 java.util.ArrayList >9:890728 42754944 org.apache.hadoop.fs.permission.AclEntry > 10:330986 29974408 [C > 11:596871 19099880 > [Lorg.apache.hadoop.hdfs.server.namenode.INode$Feature; > 12:445364 17814560 > com.google.common.collect.RegularImmutableList > 13: 844 17132816 [Ljava.util.HashMap$Entry; > 14:445364 10688736 > org.apache.hadoop.hdfs.server.namenode.AclFeature > 15:329789 10553248 java.lang.String > 16: 917418807136 > org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction > 17: 495846850280 > {quote} > And the stack trace shows it was doing reloadFromImageFile: > {quote} > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.getInode(FSDirectory.java:2426) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:160) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:243) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:121) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:902) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:888) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.reloadFromImageFile(FSImage.java:562) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1048) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:536) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:388) > at > org.apache.hadoop.hdfs.s
[jira] [Created] (HDFS-7735) Optimize decommission Datanodes to reduce the impact on NameNode's performance
zhaoyunjiong created HDFS-7735: -- Summary: Optimize decommission Datanodes to reduce the impact on NameNode's performance Key: HDFS-7735 URL: https://issues.apache.org/jira/browse/HDFS-7735 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong When decommission DataNodes, by default, DecommissionManager will check progress every 30 seconds, and it will hold writeLock of Namesystem. It significantly impact NameNode's performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7735) Optimize decommission Datanodes to reduce the impact on NameNode's performance
[ https://issues.apache.org/jira/browse/HDFS-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7735: --- Status: Patch Available (was: Open) > Optimize decommission Datanodes to reduce the impact on NameNode's performance > -- > > Key: HDFS-7735 > URL: https://issues.apache.org/jira/browse/HDFS-7735 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7735.patch > > > When decommission DataNodes, by default, DecommissionManager will check > progress every 30 seconds, and it will hold writeLock of Namesystem. It > significantly impact NameNode's performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7735) Optimize decommission Datanodes to reduce the impact on NameNode's performance
[ https://issues.apache.org/jira/browse/HDFS-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7735: --- Attachment: HDFS-7735.patch This patch try to reduce the impact in two ways: # Adjust dfs.namenode.decommission.interval from 30 to 300 to reduce the check frequency # Remove blocks already have enough live replicas from decommissioning nodes to reduce check time > Optimize decommission Datanodes to reduce the impact on NameNode's performance > -- > > Key: HDFS-7735 > URL: https://issues.apache.org/jira/browse/HDFS-7735 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7735.patch > > > When decommission DataNodes, by default, DecommissionManager will check > progress every 30 seconds, and it will hold writeLock of Namesystem. It > significantly impact NameNode's performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-7.patch Thanks Tsz Wo Nicholas Sze. Update patch to add targetPinnings, only pin on the favorite datanodes. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: (was: HDFS-6133-7.patch) > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-7.patch Thanks Nicholas. Update patch to fix test failures. By the way, I use "dev-support/test-patch.sh" to test my patch yesterday, and it didn't find out the test failure, seems my local test environments have some problems. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-8.patch This version should pass the tests. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, > HDFS-6133-8.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-9.patch Thanks Nicholas, Yongjun Zhang and Benoy Antony. Update patch add a configuration to enable/disable pin feature. I use a configuration set on DFSClient, compared to set it on DataNode side, I think it is more flexible. {quote} PBHelper.convert(..) only adds one FALSE when targetPinnings == null. Should we add n FALSEs, where n = targetPinnings.length? {quote} One is enough, because the DataNode in Pipeline will add another one when send to next. For how to implement this feature on Windows and whether use sticky bit or a second file, can we do it on another jira issue later? > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, > HDFS-6133-8.patch, HDFS-6133-9.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-10.patch Thanks Nicholas. Update patch to use dfs.datanode.block-pinning.enabled. {quote} > I guess using existence of a second file is easier to be compatible across > different platforms (though sticky bit has the advantage of storing the info > at the same file). If using a different file, we need to delete it when deleting the block. How about using the execute bit? {quote} I'm not sure whether there are block files created with execute bit set. I don't think it is safe enough to use execute bit. I prefer Yongjun Zhang's mixed mechanism: {quote} Another option is to use a mixed mechanism, say, if it's on linux, use sticky bit, otherwise, use existence of a second file. This method means lack of consistency between different platforms. Just wanted to throw a thought here. {quote} If you agree with mixed mechanism, I'll implement it before my vacation start on Feb 14th. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, > HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313405#comment-14313405 ] zhaoyunjiong commented on HDFS-6133: You mean choosing the mechanism don't rely on the os but rely on the mechanism only? > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, > HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode
zhaoyunjiong created HDFS-7759: -- Summary: Provide existence-of-a-second-file implementation for pinning blocks on Datanode Key: HDFS-7759 URL: https://issues.apache.org/jira/browse/HDFS-7759 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Provide existence-of-a-second-file implementation for pinning blocks on Datanode and let admin choosing the mechanism(use sticky bit or existence-of-a-second-file) to pinning blocks on favored Datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313432#comment-14313432 ] zhaoyunjiong commented on HDFS-6133: Sure. Create HDFS-7759 to track. Thanks for your time to review the patch. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, > HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313832#comment-14313832 ] zhaoyunjiong commented on HDFS-6133: Most of user will use 3 replicates, so NumFavoredNodes should be 3 for most case. I don't think this will cause big problem. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, > HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, HDFS-6133-9.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-11.patch Update patch to solve merge conflict in hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, > HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, > HDFS-6133-9.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317700#comment-14317700 ] zhaoyunjiong commented on HDFS-6133: Sorry, which question? I clicked the URL, but don't show the question clearly. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, datanode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 2.7.0 > > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, > HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, > HDFS-6133-9.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319477#comment-14319477 ] zhaoyunjiong commented on HDFS-6133: {quote} Is there use scenario that one want to specify larger number of favoredNodes? {quote} It might have. Let's say if user set 1000 favoredNodes, and 10 replicates, DFSOutputStream.getPinnings will do 10,000 compare at worst case. Seems not so bad. Do you think we need optimize the code? > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, datanode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 2.7.0 > > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, > HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, > HDFS-6133-9.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode
[ https://issues.apache.org/jira/browse/HDFS-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7759: --- Attachment: HDFS-7759.patch I'll add more unit test and check the logic for upgrade and finalizeUpgrade two weeks later. > Provide existence-of-a-second-file implementation for pinning blocks on > Datanode > > > Key: HDFS-7759 > URL: https://issues.apache.org/jira/browse/HDFS-7759 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7759.patch > > > Provide existence-of-a-second-file implementation for pinning blocks on > Datanode and let admin choosing the mechanism(use sticky bit or > existence-of-a-second-file) to pinning blocks on favored Datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339994#comment-14339994 ] zhaoyunjiong commented on HDFS-6133: At the first, my idea is very similar with HDFS-4420. The main difference is I change getBlocks to exclude blocks belongs to some path. https://issues.apache.org/jira/browse/HDFS-6133?focusedCommentId=13943050&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13943050 After Daryn Sharp's comments, it changed to pin: https://issues.apache.org/jira/browse/HDFS-6133?focusedCommentId=13980504&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13980504 > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, datanode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 2.7.0 > > Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, > HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, > HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, > HDFS-6133-9.patch, HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode
[ https://issues.apache.org/jira/browse/HDFS-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7759: --- Status: Patch Available (was: Open) Just checked the upgrade logic, it will take care all the files start with "blk_". Nicholas, do you have any comments for this patch? > Provide existence-of-a-second-file implementation for pinning blocks on > Datanode > > > Key: HDFS-7759 > URL: https://issues.apache.org/jira/browse/HDFS-7759 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7759.patch > > > Provide existence-of-a-second-file implementation for pinning blocks on > Datanode and let admin choosing the mechanism(use sticky bit or > existence-of-a-second-file) to pinning blocks on favored Datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7759) Provide existence-of-a-second-file implementation for pinning blocks on Datanode
[ https://issues.apache.org/jira/browse/HDFS-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342638#comment-14342638 ] zhaoyunjiong commented on HDFS-7759: Tests failure are not related. > Provide existence-of-a-second-file implementation for pinning blocks on > Datanode > > > Key: HDFS-7759 > URL: https://issues.apache.org/jira/browse/HDFS-7759 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-7759.patch > > > Provide existence-of-a-second-file implementation for pinning blocks on > Datanode and let admin choosing the mechanism(use sticky bit or > existence-of-a-second-file) to pinning blocks on favored Datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong reopened HDFS-5396: I made a mistake when I resolved this as Not A Problem. Because for (Iterator it = dirIterator(NameNodeDirType.IMAGE); it.hasNext();) sd = it.next(); will return last StorageDirectory of image, but due to HDFS-5367, it may not have fsimage in it. > FSImage.getFsImageName should check whether fsimage exists > -- > > Key: HDFS-5396 > URL: https://issues.apache.org/jira/browse/HDFS-5396 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 1.3.0 > > Attachments: HDFS-5396-branch-1.2.patch > > > In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to > all IMAGE dir, so we need to check whether fsimage exists before > FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
zhaoyunjiong created HDFS-5944: -- Summary: LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0, 1.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is this a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Description: In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. was: In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is this a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: HDFS-5944.patch HDFS-5944-branch-1.2.patch This patch is very simple, if prefix ended with '/', just minus 1 from srclen, so p.charAt(srclen) could handle path correctly. > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch > > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901171#comment-13901171 ] zhaoyunjiong commented on HDFS-5944: Brandon, thanks for your time to review this patch. I don't think the user use DFSClient directly. Even use DistributedFileSystem, we still can send path ending with "/" by passing path like this "/a/b/../". Because in getPathName, String result = makeAbsolute(file).toUri().getPath() will return "/a/". About unit test, I'd be happy to add one. I have two questions need your help: 1. Is it enough for just writing a unit test for findLeaseWithPrefixPath? 2. In trunk, there is no TestLeaseManager.java, should I add one? > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, > HDFS-5944.test.txt > > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: HDFS-5944-branch-1.2.patch HDFS-5944.patch Update patches with unit test. > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944-branch-1.2.patch, > HDFS-5944.patch, HDFS-5944.patch, HDFS-5944.test.txt > > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: (was: HDFS-5944-branch-1.2.patch) > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, > HDFS-5944.test.txt > > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: (was: HDFS-5944.patch) > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, > HDFS-5944.test.txt > > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905361#comment-13905361 ] zhaoyunjiong commented on HDFS-5944: Multiple trailing "/" is impossible. > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, > HDFS-5944.test.txt > > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906435#comment-13906435 ] zhaoyunjiong commented on HDFS-5944: Thank you Brandon and Benoy. > LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right > cause SecondaryNameNode failed do checkpoint > - > > Key: HDFS-5944 > URL: https://issues.apache.org/jira/browse/HDFS-5944 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.0, 2.2.0 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, > HDFS-5944.test.txt, HDFS-5944.trunk.patch > > > In our cluster, we encountered error like this: > java.io.IOException: saveLeases found path > /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) > What happened: > Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. > And Client A continue refresh it's lease. > Client B deleted /XXX/20140206/04_30/ > Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write > Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log > Then secondaryNameNode try to do checkpoint and failed due to failed to > delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. > The reason is a bug in findLeaseWithPrefixPath: > int srclen = prefix.length(); > if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { > entries.put(entry.getKey(), entry.getValue()); > } > Here when prefix is /XXX/20140206/04_30/, and p is > /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. > The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5396: --- Attachment: HDFS-5396-branch-1.2.patch Update patch. > FSImage.getFsImageName should check whether fsimage exists > -- > > Key: HDFS-5396 > URL: https://issues.apache.org/jira/browse/HDFS-5396 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 1.3.0 > > Attachments: HDFS-5396-branch-1.2.patch, HDFS-5396-branch-1.2.patch > > > In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to > all IMAGE dir, so we need to check whether fsimage exists before > FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5396: --- Status: Patch Available (was: Reopened) > FSImage.getFsImageName should check whether fsimage exists > -- > > Key: HDFS-5396 > URL: https://issues.apache.org/jira/browse/HDFS-5396 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.2.1 >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Fix For: 1.3.0 > > Attachments: HDFS-5396-branch-1.2.patch, HDFS-5396-branch-1.2.patch > > > In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to > all IMAGE dir, so we need to check whether fsimage exists before > FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase
zhaoyunjiong created HDFS-6133: -- Summary: Make Balancer support don't move blocks belongs to Hbase Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like "/hbase", then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133.patch This patch make Balancer support don't move blocks belongs to Hbase > Make Balancer support don't move blocks belongs to Hbase > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)