[jira] [Commented] (HDDS-2287) Move ozone source code to apache/hadoop-ozone from apache/hadoop

2019-10-23 Thread Vinod Kumar Vavilapalli (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957761#comment-16957761
 ] 

Vinod Kumar Vavilapalli commented on HDDS-2287:
---

Is there a concrete proposal on which exact modules move and which don't? 
Scanned the wiki page and don't find it.

Color me completely disconnected, but I was a bit surprised to see on 
HADOOP-16654 that HDDS also moves out. When it all started, didn't we talk 
about possibility of multiple storage systems on top of HDDS and Ozone happens 
to be one of them? If so, why do we tie them together this way? Shouldn't HDDS 
stay in the 'core' ? Or are Ozone & HDDS tightly coupled now?

> Move ozone source code to apache/hadoop-ozone from apache/hadoop
> 
>
> Key: HDDS-2287
> URL: https://issues.apache.org/jira/browse/HDDS-2287
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
>
> *This issue is created to use the assigned number for any technical commits 
> to make it easy to follow the root reason of the commit...*
>  
> As discussed and voted on the mailing lists, Apache Hadoop Ozone source code 
> will be removed from the hadoop trunk and stored in a separated repository.
>  
> Original discussion is here:
> [https://lists.apache.org/thread.html/ef01b7def94ba58f746875999e419e10645437423ab9af19b32821e7@%3Chdfs-dev.hadoop.apache.org%3E]
> (It's started as a discussion but as everybody started to vote it's finished 
> with a call to a lazy consensus vote)
>  
> Technical proposal is shared on the wiki: 
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Ozone+source+tree+split]
>  
> Discussed on the community meeting: 
> [https://cwiki.apache.org/confluence/display/HADOOP/2019-09-30+Meeting+notes]
>  
> Which is shared on the mailing list to get more feedback: 
> [https://lists.apache.org/thread.html/ed608c708ea302675ae5e39636ed73613f47a93c2ddfbd3c9e24dbae@%3Chdfs-dev.hadoop.apache.org%3E]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12917) Fix description errors in testErasureCodingConf.xml

2018-02-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-12917:
---

Committer of this patch, please set the *Fix Version/s* for this JIRA.

> Fix description errors in testErasureCodingConf.xml
> ---
>
> Key: HDFS-12917
> URL: https://issues.apache.org/jira/browse/HDFS-12917
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: chencan
>Assignee: chencan
>Priority: Major
> Attachments: HADOOP-12917.002.patch, HADOOP-12917.patch
>
>
> In testErasureCodingConf.xml,there are two case's description may be 
> "getPolicy : get EC policy information at specified path, whick have an EC 
> Policy".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13106) Need to exercise all HDFS APIs for EC

2018-02-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13106:
---

Committer of this patch, please set the *Fix Version/s* for this JIRA.

> Need to exercise all HDFS APIs for EC
> -
>
> Key: HDFS-13106
> URL: https://issues.apache.org/jira/browse/HDFS-13106
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: hdfs
>Affects Versions: 3.0.0
>Reporter: Haibo Yan
>Assignee: Haibo Yan
>Priority: Major
> Attachments: HDFS-13106.001.patch, HDFS-13106.002.patch, 
> HDFS-13106.003.patch
>
>
> Exercise FileSystem API to make sure all APIs works as expected under Erasure 
> Coding feature enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12713) [READ] Refactor FileRegion and BlockAliasMap to separate out HDFS metadata and PROVIDED storage metadata

2017-12-18 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295118#comment-16295118
 ] 

Vinod Kumar Vavilapalli commented on HDFS-12713:


[~virajith], couple of things
 - Major: The JIRA is missing fix-version, please set it. In general, you 
should set the fix-version at commit time.
 - Minor: Once you've reviewed the patch, you can set the Reviewed flag, an 
option that pops up when you are resolving the JIRA.

> [READ] Refactor FileRegion and BlockAliasMap to separate out HDFS metadata 
> and PROVIDED storage metadata
> 
>
> Key: HDFS-12713
> URL: https://issues.apache.org/jira/browse/HDFS-12713
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Ewan Higgs
> Attachments: HDFS-12713-HDFS-9806.001.patch, 
> HDFS-12713-HDFS-9806.002.patch, HDFS-12713-HDFS-9806.003.patch, 
> HDFS-12713-HDFS-9806.004.patch, HDFS-12713-HDFS-9806.005.patch, 
> HDFS-12713-HDFS-9806.006.patch, HDFS-12713-HDFS-9806.007.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12591) [READ] Implement LevelDBFileRegionFormat

2017-12-18 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295123#comment-16295123
 ] 

Vinod Kumar Vavilapalli commented on HDFS-12591:


[~virajith], the JIRA is missing fix-version, please set it.

> [READ] Implement LevelDBFileRegionFormat
> 
>
> Key: HDFS-12591
> URL: https://issues.apache.org/jira/browse/HDFS-12591
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Minor
> Attachments: HDFS-12591-HDFS-9806.001.patch, 
> HDFS-12591-HDFS-9806.002.patch, HDFS-12591-HDFS-9806.003.patch, 
> HDFS-12591-HDFS-9806.004.patch, HDFS-12591-HDFS-9806.005.patch, 
> HDFS-12591-HDFS-9806.006.patch, HDFS-12591-HDFS-9806.007.patch
>
>
> The existing work for HDFS-9806 uses an implementation of the {{FileRegion}} 
> read from a csv file. This is good for testability and diagnostic purposes, 
> but it is not very efficient for larger systems.
> There should be a version that is similar to the {{TextFileRegionFormat}} 
> that instead uses LevelDB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12665) [AliasMap] Create a version of the AliasMap that runs in memory in the Namenode (leveldb)

2017-12-18 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295125#comment-16295125
 ] 

Vinod Kumar Vavilapalli commented on HDFS-12665:


[~virajith], the JIRA is missing fix-version, please set it.

> [AliasMap] Create a version of the AliasMap that runs in memory in the 
> Namenode (leveldb)
> -
>
> Key: HDFS-12665
> URL: https://issues.apache.org/jira/browse/HDFS-12665
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
> Attachments: HDFS-12665-HDFS-9806.001.patch, 
> HDFS-12665-HDFS-9806.002.patch, HDFS-12665-HDFS-9806.003.patch, 
> HDFS-12665-HDFS-9806.004.patch, HDFS-12665-HDFS-9806.005.patch, 
> HDFS-12665-HDFS-9806.006.patch, HDFS-12665-HDFS-9806.007.patch, 
> HDFS-12665-HDFS-9806.008.patch, HDFS-12665-HDFS-9806.009.patch, 
> HDFS-12665-HDFS-9806.010.patch, HDFS-12665-HDFS-9806.011.patch, 
> HDFS-12665-HDFS-9806.012.patch
>
>
> The design of Provided Storage requires the use of an AliasMap to manage the 
> mapping between blocks of files on the local HDFS and ranges of files on a 
> remote storage system. To reduce load from the Namenode, this can be done 
> using a pluggable external service (e.g. AzureTable, Cassandra, Ratis). 
> However, to aide adoption and ease of deployment, we propose an in memory 
> version.
> This AliasMap will be a wrapper around LevelDB (already a dependency from the 
> Timeline Service) and use protobuf for the key (blockpool, blockid, and 
> genstamp) and the value (url, offset, length, nonce). The in memory service 
> will also have a configurable port on which it will listen for updates from 
> Storage Policy Satisfier (SPS) Coordinating Datanodes (C-DN).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-861) TestNodeManager unit tests are broken

2018-11-21 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694967#comment-16694967
 ] 

Vinod Kumar Vavilapalli commented on HDDS-861:
--

While you are at it, can you rename the test to be {{TestSCMNodeManager}}? We 
have a {{TestNodeManager}} on the YARN side :)

> TestNodeManager unit tests are broken
> -
>
> Key: HDDS-861
> URL: https://issues.apache.org/jira/browse/HDDS-861
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.4.0
>Reporter: Shashikant Banerjee
>Priority: Major
> Fix For: 0.4.0
>
>
> Many of the tests are failing with NullPointerException
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.updateNodeStat(SCMNodeManager.java:195)
> at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:276)
> at 
> org.apache.hadoop.hdds.scm.TestUtils.createRandomDatanodeAndRegister(TestUtils.java:147)
> at 
> org.apache.hadoop.hdds.scm.node.TestNodeManager.testScmHeartbeat(TestNodeManager.java:152)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
> at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
> at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
> at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
> at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13547) Add ingress port based sasl resolver

2018-11-29 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13547:
---
Fix Version/s: (was: 3.1.1)
   3.2.0

I just checked the branches, this never made it to 3.1.1 even though the 
fix-version is set so.

It's only in branch-3.2, branch-3.2.0 and trunk.

Release-notes for 3.1.1 (which is already released) are broken, but it is what 
it is.

Editing the fix-version.

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-13547) Add ingress port based sasl resolver

2018-11-29 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened HDFS-13547:


Oh, and this never made it to branch-3 either. [~vagarychen], I am reopening 
this, please put this in branch-3 too.

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13547) Add ingress port based sasl resolver

2018-11-30 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705271#comment-16705271
 ] 

Vinod Kumar Vavilapalli commented on HDFS-13547:


[~vagarychen], 3.1.1 already released a while ago, so this shouldn't land in 
branch-3.1.1 nor should this JIRA have the fix-version 3.1.1.

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13547) Add ingress port based sasl resolver

2018-11-30 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13547:
---
Fix Version/s: (was: 3.1.1)

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2018-05-29 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494368#comment-16494368
 ] 

Vinod Kumar Vavilapalli commented on HDFS-13596:


[~hanishakoneru], how does this relate to HDFS-11096? Can you leave a comment 
on that JIRA? Thanks!

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Priority: Critical
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org

[jira] [Reopened] (HDFS-11190) [READ] Namenode support for data stored in external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened HDFS-11190:


Resolving as Fixed instead of as Resolved per our conventions.

> [READ] Namenode support for data stored in external stores.
> ---
>
> Key: HDFS-11190
> URL: https://issues.apache.org/jira/browse/HDFS-11190
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-11190-HDFS-9806.001.patch, 
> HDFS-11190-HDFS-9806.002.patch, HDFS-11190-HDFS-9806.003.patch, 
> HDFS-11190-HDFS-9806.004.patch
>
>
> The goal of this JIRA is to enable the Namenode to know about blocks that are 
> in {{PROVIDED}} stores and are not necessarily stored on any Datanodes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-11190) [READ] Namenode support for data stored in external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved HDFS-11190.

Resolution: Fixed

> [READ] Namenode support for data stored in external stores.
> ---
>
> Key: HDFS-11190
> URL: https://issues.apache.org/jira/browse/HDFS-11190
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-11190-HDFS-9806.001.patch, 
> HDFS-11190-HDFS-9806.002.patch, HDFS-11190-HDFS-9806.003.patch, 
> HDFS-11190-HDFS-9806.004.patch
>
>
> The goal of this JIRA is to enable the Namenode to know about blocks that are 
> in {{PROVIDED}} stores and are not necessarily stored on any Datanodes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-10675) [READ] Datanode support to read from external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened HDFS-10675:


Resolving as Fixed instead of as Resolved per our conventions.

> [READ] Datanode support to read from external stores.
> -
>
> Key: HDFS-10675
> URL: https://issues.apache.org/jira/browse/HDFS-10675
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-10675-HDFS-9806.001.patch, 
> HDFS-10675-HDFS-9806.002.patch, HDFS-10675-HDFS-9806.003.patch, 
> HDFS-10675-HDFS-9806.004.patch, HDFS-10675-HDFS-9806.005.patch, 
> HDFS-10675-HDFS-9806.006.patch, HDFS-10675-HDFS-9806.007.patch, 
> HDFS-10675-HDFS-9806.008.patch, HDFS-10675-HDFS-9806.009.patch
>
>
> This JIRA introduces a new {{PROVIDED}} {{StorageType}} to represent external 
> stores, along with enabling the Datanode to read from such stores using a 
> {{ProvidedReplica}} and a {{ProvidedVolume}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-10675) [READ] Datanode support to read from external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved HDFS-10675.

Resolution: Fixed

> [READ] Datanode support to read from external stores.
> -
>
> Key: HDFS-10675
> URL: https://issues.apache.org/jira/browse/HDFS-10675
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-10675-HDFS-9806.001.patch, 
> HDFS-10675-HDFS-9806.002.patch, HDFS-10675-HDFS-9806.003.patch, 
> HDFS-10675-HDFS-9806.004.patch, HDFS-10675-HDFS-9806.005.patch, 
> HDFS-10675-HDFS-9806.006.patch, HDFS-10675-HDFS-9806.007.patch, 
> HDFS-10675-HDFS-9806.008.patch, HDFS-10675-HDFS-9806.009.patch
>
>
> This JIRA introduces a new {{PROVIDED}} {{StorageType}} to represent external 
> stores, along with enabling the Datanode to read from such stores using a 
> {{ProvidedReplica}} and a {{ProvidedVolume}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13218) Log audit event only used last EC policy name when add multiple policies from file

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13218:
---
Fix Version/s: (was: 3.0.1)
   (was: 3.1.0)

Removing 3.1.0 fix-version from all JIRAs which are Invalid / Won't Fix / 
Duplicate / Cannot Reproduce.

> Log audit event only used last EC policy name when add multiple policies from 
> file 
> ---
>
> Key: HDFS-13218
> URL: https://issues.apache.org/jira/browse/HDFS-13218
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.1.0
>Reporter: liaoyuxiangqin
>Priority: Major
>
> When i read the addErasureCodingPolicies() of FSNamesystem class in namenode, 
> i found the following code only used last ec policy name for  logAuditEvent, 
> i think this audit log can't track whole policies for the add multiple 
> erasure coding policies to the ErasureCodingPolicyManager. Thanks.
> {code:java|title=FSNamesystem.java|borderStyle=solid}
> try {
>   checkOperation(OperationCategory.WRITE);
>   checkNameNodeSafeMode("Cannot add erasure coding policy");
>   for (ErasureCodingPolicy policy : policies) {
> try {
>   ErasureCodingPolicy newPolicy =
>   FSDirErasureCodingOp.addErasureCodingPolicy(this, policy,
>   logRetryCache);
>   addECPolicyName = newPolicy.getName();
>   responses.add(new AddErasureCodingPolicyResponse(newPolicy));
> } catch (HadoopIllegalArgumentException e) {
>   responses.add(new AddErasureCodingPolicyResponse(policy, e));
> }
>   }
>   success = true;
>   return responses.toArray(new AddErasureCodingPolicyResponse[0]);
> } finally {
>   writeUnlock(operationName);
>   if (success) {
> getEditLog().logSync();
>   }
>   logAuditEvent(success, operationName,addECPolicyName, null, null);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4353) Encapsulate connections to peers in Peer and PeerServer classes

2013-01-03 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543591#comment-13543591
 ] 

Vinod Kumar Vavilapalli commented on HDFS-4353:
---


Send an email to hdfs-dev-unsubscr...@hadoop.apache.org .

Thanks,
+Vinod





> Encapsulate connections to peers in Peer and PeerServer classes
> ---
>
> Key: HDFS-4353
> URL: https://issues.apache.org/jira/browse/HDFS-4353
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.3-alpha
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: 02b-cumulative.patch, 02-cumulative.patch
>
>
> Encapsulate connections to peers into the {{Peer}} and {{PeerServer}} 
> classes.  Since many Java classes may be involved with these connections, it 
> makes sense to create a container for them.  For example, a connection to a 
> peer may have an input stream, output stream, readablebytechannel, encrypted 
> output stream, and encrypted input stream associated with it.
> This makes us less dependent on the {{NetUtils}} methods which use 
> {{instanceof}} to manipulate socket and stream states based on the runtime 
> type.  it also paves the way to introduce UNIX domain sockets which don't 
> inherit from {{java.net.Socket}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-2229) Deadlock in NameNode

2011-08-05 Thread Vinod Kumar Vavilapalli (JIRA)
Deadlock in NameNode


 Key: HDFS-2229
 URL: https://issues.apache.org/jira/browse/HDFS-2229
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli


Either I am doing something incredibly stupid, or something about my 
environment is completely weird, or may be it really is a valid bug. I am 
running a NameNode deadlock consistently with 0.23 HDFS. I could never start NN 
successfully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2229) Deadlock in NameNode

2011-08-05 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079910#comment-13079910
 ] 

Vinod Kumar Vavilapalli commented on HDFS-2229:
---

Thread dump:

{quote}
Found one Java-level deadlock:
=
"Thread[Thread-10,5,main]":
  waiting to lock monitor 0x08e19b8c (object 0x31b0b7f0, a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
  which is held by "main"
"main":
  waiting for ownable synchronizer 0x31641a50, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),
  which is held by "Thread[Thread-10,5,main]"

Java stack information for the threads listed above:
===
"Thread[Thread-10,5,main]":
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3183)
- waiting to lock <0x31b0b7f0> (a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.isInSafeMode(FSNamesystem.java:3563)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logUpdateMasterKey(FSNamesystem.java:4523)
at 
org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logUpdateMasterKey(DelegationTokenSecretManager.java:279)
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:144)
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.rollMasterKey(AbstractDelegationTokenSecretManager.java:168)
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager$ExpiredTokenRemover.run(AbstractDelegationTokenSecretManager.java:373)
at java.lang.Thread.run(Thread.java:619)
"main":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x31641a50> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeLock(FSNamesystem.java:382)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlocks(BlockManager.java:1743)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.initializeReplQueues(FSNamesystem.java:3257)
- locked <0x31b0b7f0> (a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.leave(FSNamesystem.java:3228)
- locked <0x31b0b7f0> (a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.checkMode(FSNamesystem.java:3315)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.setBlockTotal(FSNamesystem.java:3342)
- locked <0x31b0b7f0> (a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setBlockTotal(FSNamesystem.java:3619)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.activate(FSNamesystem.java:322)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.activate(NameNode.java:489)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:452)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:561)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:553)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1538)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1578)

Found 1 deadlock.
{quote}

Logs show that RPC server is started but no the http server
{quote}
...
2011-08-05 08:54:35,286 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Number of files = 1
2011-08-05 08:54:35,286 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Number of files under construction = 0
2011-08-05 08:54:35,287 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Image file of size 113 loaded in 0 seconds.
2011-08-05 08:54:35,287 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Loaded image for txid 0 from 
/tmp/hdfs/hadoop/var/hdfs/name/current/fsimage_000
2011-08-05 08:54:35,288 INFO org.apache.hadoop.hdfs.server.name

[jira] [Commented] (HDFS-2229) Deadlock in NameNode

2011-08-08 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081428#comment-13081428
 ] 

Vinod Kumar Vavilapalli commented on HDFS-2229:
---

bq. Any idea why the MR cluster launches are triggering this issue reliably but 
the HDFS tests are passing fine? (when they launch hundreds of miniclusters)
My setup is trunk code with security enabled on a real cluster (not single 
node).

I haven't had time to run it again with the attached patch. Will report soon 
once I get around to it.

> Deadlock in NameNode
> 
>
> Key: HDFS-2229
> URL: https://issues.apache.org/jira/browse/HDFS-2229
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs client
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Priority: Blocker
> Attachments: cycle1.png, cycle2.png, h2229_20110807.patch
>
>
> Either I am doing something incredibly stupid, or something about my 
> environment is completely weird, or may be it really is a valid bug. I am 
> running a NameNode deadlock consistently with 0.23 HDFS. I could never start 
> NN successfully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2197) Refactor RPC call implementations out of NameNode class

2011-09-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097816#comment-13097816
 ] 

Vinod Kumar Vavilapalli commented on HDFS-2197:
---

Todd, I just ran into MAPREDUCE-2935, can you please take a look? Thanks!

Oh, BTW, this ticket isn't resolved yet, though the patch went in.

> Refactor RPC call implementations out of NameNode class
> ---
>
> Key: HDFS-2197
> URL: https://issues.apache.org/jira/browse/HDFS-2197
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: hdfs-2197-2.txt, hdfs-2197-3.txt, hdfs-2197-4.txt, 
> hdfs-2197-trunk-final.txt, hdfs-2197.txt
>
>
> For HA, the NameNode will gain a bit of a state machine, to be able to 
> transition between standby and active states. This would be cleaner in the 
> code if the {{NameNode}} class were just a container for various services, as 
> discussed in HDFS-1974. It's also nice for testing, where it would become 
> easier to construct just the RPC handlers around a mock NameSystem, with no 
> HTTP server, for example.
> This JIRA is to move all of the protocol implementations out of {{NameNode}} 
> into a separate {{NameNodeRPCServer}} class.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1620) Rename HdfsConstants -> HdfsServerConstants, FSConstants -> HdfsConstants

2011-09-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097875#comment-13097875
 ] 

Vinod Kumar Vavilapalli commented on HDFS-1620:
---

I ran into MAPREDUCE-2936 because of this patch.

Why isn't this issue be fixed in a backwards compatible way like Harsh's [first 
comment on this 
ticket|https://issues.apache.org/jira/browse/HDFS-1620?focusedCommentId=13030371&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13030371]
 outlined?

> Rename HdfsConstants -> HdfsServerConstants, FSConstants -> HdfsConstants
> -
>
> Key: HDFS-1620
> URL: https://issues.apache.org/jira/browse/HDFS-1620
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 0.24.0
>Reporter: Tsz Wo (Nicholas), SZE
>Assignee: Harsh J
>Priority: Minor
> Fix For: 0.24.0
>
> Attachments: HDFS-1620.r1.diff, HDFS-1620.r2.diff, HDFS-1620.r3.diff, 
> HDFS-1620.r4.diff
>
>
> We have {{org.apache.hadoop.hdfs.protocol.*FSConstants*}} and 
> {{org.apache.hadoop.fs.*FsConstants*}}.  Elegant or confused?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1620) Rename HdfsConstants -> HdfsServerConstants, FSConstants -> HdfsConstants

2011-09-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097895#comment-13097895
 ] 

Vinod Kumar Vavilapalli commented on HDFS-1620:
---

bq. The intended audience was private, and hence just HDFS components need to 
use it (with RAID needing to exist in MR, being the only exception).
Alright, that settles it. I will go ahead and fix RAID. Thanks for the quick 
reply.

> Rename HdfsConstants -> HdfsServerConstants, FSConstants -> HdfsConstants
> -
>
> Key: HDFS-1620
> URL: https://issues.apache.org/jira/browse/HDFS-1620
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 0.24.0
>Reporter: Tsz Wo (Nicholas), SZE
>Assignee: Harsh J
>Priority: Minor
> Fix For: 0.24.0
>
> Attachments: HDFS-1620.r1.diff, HDFS-1620.r2.diff, HDFS-1620.r3.diff, 
> HDFS-1620.r4.diff
>
>
> We have {{org.apache.hadoop.hdfs.protocol.*FSConstants*}} and 
> {{org.apache.hadoop.fs.*FsConstants*}}.  Elegant or confused?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2314) MRV1 test compilation broken after HDFS-2197

2011-09-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098191#comment-13098191
 ] 

Vinod Kumar Vavilapalli commented on HDFS-2314:
---

Can confirm that test compilation passes with the attached patch.

The patch looks good too. +1. Couldn't think of that simple a fix ;)

> MRV1 test compilation broken after HDFS-2197
> 
>
> Key: HDFS-2314
> URL: https://issues.apache.org/jira/browse/HDFS-2314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Todd Lipcon
> Attachments: hdfs-2314.txt
>
>
> Runing the following:
>   At the trunk level: {{mvn clean install package -Dtar -Pdist 
> -Dmaven.test.skip.exec=true}}
>   In hadoop-mapreduce-project: {{ant jar-test -Dresolvers=internal}}
> yields the errors:
> {code}
> [javac] 
> /home/vinodkv/Workspace/eclipse-workspace/apache-git/hadoop-common/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/security/authorize/TestServiceLevelAuthorization.java:62:
>  cannot find symbol
> [javac] symbol  : method 
> getRpcServer(org.apache.hadoop.hdfs.server.namenode.NameNode)
> [javac] location: class 
> org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter
> [javac]   Set> protocolsWithAcls = 
> NameNodeAdapter.getRpcServer(dfs.getNameNode())
> [javac]^
> [javac] 
> /home/vinodkv/Workspace/eclipse-workspace/apache-git/hadoop-common/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/security/authorize/TestServiceLevelAuthorization.java:79:
>  cannot find symbol
> [javac] symbol  : method 
> getRpcServer(org.apache.hadoop.hdfs.server.namenode.NameNode)
> [javac] location: class 
> org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter
> [javac]   protocolsWithAcls = 
> NameNodeAdapter.getRpcServer(dfs.getNameNode())
> [javac] 2 errors
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2314) MRV1 test compilation broken after HDFS-2197

2011-09-06 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-2314:
--

Fix Version/s: 0.24.0
   0.23.0
 Hadoop Flags: [Reviewed]

Please push this to trunk *and* 0.23 once Jenkins is done with all his 
machinations. Thanks!

> MRV1 test compilation broken after HDFS-2197
> 
>
> Key: HDFS-2314
> URL: https://issues.apache.org/jira/browse/HDFS-2314
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Todd Lipcon
> Fix For: 0.23.0, 0.24.0
>
> Attachments: hdfs-2314.txt
>
>
> Runing the following:
>   At the trunk level: {{mvn clean install package -Dtar -Pdist 
> -Dmaven.test.skip.exec=true}}
>   In hadoop-mapreduce-project: {{ant jar-test -Dresolvers=internal}}
> yields the errors:
> {code}
> [javac] 
> /home/vinodkv/Workspace/eclipse-workspace/apache-git/hadoop-common/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/security/authorize/TestServiceLevelAuthorization.java:62:
>  cannot find symbol
> [javac] symbol  : method 
> getRpcServer(org.apache.hadoop.hdfs.server.namenode.NameNode)
> [javac] location: class 
> org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter
> [javac]   Set> protocolsWithAcls = 
> NameNodeAdapter.getRpcServer(dfs.getNameNode())
> [javac]^
> [javac] 
> /home/vinodkv/Workspace/eclipse-workspace/apache-git/hadoop-common/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/security/authorize/TestServiceLevelAuthorization.java:79:
>  cannot find symbol
> [javac] symbol  : method 
> getRpcServer(org.apache.hadoop.hdfs.server.namenode.NameNode)
> [javac] location: class 
> org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter
> [javac]   protocolsWithAcls = 
> NameNodeAdapter.getRpcServer(dfs.getNameNode())
> [javac] 2 errors
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-8498) Blocks can be committed with wrong size

2016-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219057#comment-15219057
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8498:
---

[~zhz] / [~kihwal] / [~daryn], I think we should close this as Won't fix or as 
a dup of HDFS-9289.

This bug keeps appearing in the blocker/critical list for releases, but we 
don't seem to be progressing.


> Blocks can be committed with wrong size
> ---
>
> Key: HDFS-8498
> URL: https://issues.apache.org/jira/browse/HDFS-8498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> When an IBR for a UC block arrives, the NN updates the expected location's 
> block and replica state _only_ if it's on an unexpected storage for an 
> expected DN.  If it's for an expected storage, only the genstamp is updated.  
> When the block is committed, and the expected locations are verified, only 
> the genstamp is checked.  The size is not checked but it wasn't updated in 
> the expected locations anyway.
> A faulty client may misreport the size when committing the block.  The block 
> is effectively corrupted.  If the NN issues replications, the received IBR is 
> considered corrupt, the NN invalidates the block, immediately issues another 
> replication.  The NN eventually realizes all the original replicas are 
> corrupt after full BRs are received from the original DNs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8674) Improve performance of postponed block scans

2016-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219121#comment-15219121
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8674:
---

[~daryn] / [~mingma], is it possible to get this in 2.7.3 in a week? Tx?

> Improve performance of postponed block scans
> 
>
> Key: HDFS-8674
> URL: https://issues.apache.org/jira/browse/HDFS-8674
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-8674.patch, HDFS-8674.patch
>
>
> When a standby goes active, it marks all nodes as "stale" which will cause 
> block invalidations for over-replicated blocks to be queued until full block 
> reports are received from the nodes with the block.  The replication monitor 
> scans the queue with O(N) runtime.  It picks a random offset and iterates 
> through the set to randomize blocks scanned.
> The result is devastating when a cluster loses multiple nodes during a 
> rolling upgrade. Re-replication occurs, the nodes come back, the excess block 
> invalidations are postponed. Rescanning just 2k blocks out of millions of 
> postponed blocks may take multiple seconds. During the scan, the write lock 
> is held which stalls all other processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9263) tests are using /test/build/data; breaking Jenkins

2016-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219123#comment-15219123
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9263:
---

Either ways, can you please see if this can get into 2.7.3 within a week? Tx.

> tests are using /test/build/data; breaking Jenkins
> --
>
> Key: HDFS-9263
> URL: https://issues.apache.org/jira/browse/HDFS-9263
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Blocker
> Attachments: HDFS-9263-001.patch, HDFS-9263-002.patch
>
>
> Some of the HDFS tests are using the path {{test/build/data}} to store files, 
> so leaking files which fail the new post-build RAT test checks on Jenkins 
> (and dirtying all development systems with paths which {{mvn clean}} will 
> miss.
> fix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8893) DNs with failed volumes stop serving during rolling upgrade

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224988#comment-15224988
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8893:
---

[~daryn] / [~shahrs87], any progress on this? Is this this still a bug? 
Considering this for a 2.7.3 RC later this week.


> DNs with failed volumes stop serving during rolling upgrade
> ---
>
> Key: HDFS-8893
> URL: https://issues.apache.org/jira/browse/HDFS-8893
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Rushabh S Shah
>Assignee: Daryn Sharp
>Priority: Critical
>
> When a rolling upgrade starts, all DNs try to write a rolling_upgrade marker 
> to each of their volumes. If one of the volumes is bad, this will fail. When 
> this failure happens, the DN does not update the key it received from the NN.
> Unfortunately we had one failed volume on all the 3 datanodes which were 
> having replica.
> Keys expire after 20 hours so at about 20 hours into the rolling upgrade, the 
> DNs with failed volumes will stop serving clients.
> Here is the stack trace on the datanode size:
> {noformat}
> 2015-08-11 07:32:28,827 [DataNode: heartbeating to 8020] WARN 
> datanode.DataNode: IOException in offerService
> java.io.IOException: Read-only file system
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:947)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setRollingUpgradeMarkers(BlockPoolSliceStorage.java:721)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.setRollingUpgradeMarker(DataStorage.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.setRollingUpgradeMarker(FsDatasetImpl.java:2357)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.signalRollingUpgrade(BPOfferService.java:480)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.handleRollingUpgradeStatus(BPServiceActor.java:626)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:677)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:833)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8871) Decommissioning of a node with a failed volume may not start

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224994#comment-15224994
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8871:
---

[~daryn] / [~kihwal], any update on this? Is this still a bug? Considering this 
for a 2.7.3 RC later this week. Thanks.


> Decommissioning of a node with a failed volume may not start
> 
>
> Key: HDFS-8871
> URL: https://issues.apache.org/jira/browse/HDFS-8871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Kihwal Lee
>Assignee: Daryn Sharp
>Priority: Critical
>
> Since staleness may not be properly cleared, a node with a failed volume may 
> not actually get scanned for block replication. Nothing is being replicated 
> from these nodes.
> This bug does not manifest unless the datanode has a unique storage ID per 
> volume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8893) DNs with failed volumes stop serving during rolling upgrade

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225002#comment-15225002
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8893:
---

Tx for the quick response, [~shahrs87]!

> DNs with failed volumes stop serving during rolling upgrade
> ---
>
> Key: HDFS-8893
> URL: https://issues.apache.org/jira/browse/HDFS-8893
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Rushabh S Shah
>Assignee: Daryn Sharp
>Priority: Critical
>
> When a rolling upgrade starts, all DNs try to write a rolling_upgrade marker 
> to each of their volumes. If one of the volumes is bad, this will fail. When 
> this failure happens, the DN does not update the key it received from the NN.
> Unfortunately we had one failed volume on all the 3 datanodes which were 
> having replica.
> Keys expire after 20 hours so at about 20 hours into the rolling upgrade, the 
> DNs with failed volumes will stop serving clients.
> Here is the stack trace on the datanode size:
> {noformat}
> 2015-08-11 07:32:28,827 [DataNode: heartbeating to 8020] WARN 
> datanode.DataNode: IOException in offerService
> java.io.IOException: Read-only file system
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:947)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setRollingUpgradeMarkers(BlockPoolSliceStorage.java:721)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.setRollingUpgradeMarker(DataStorage.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.setRollingUpgradeMarker(FsDatasetImpl.java:2357)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.signalRollingUpgrade(BPOfferService.java:480)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.handleRollingUpgradeStatus(BPServiceActor.java:626)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:677)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:833)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225026#comment-15225026
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

I commented on the JIRA way back (see 
https://issues.apache.org/jira/browse/HDFS-8791?focusedCommentId=1503&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1503),
 saying what I said below. Unfortunately, I haven’t followed the patch along 
after my initial comment. 

This isn’t about any specific release - starting 2.6 we declared support for 
rolling upgrades and downgrades. Any patch that breaks this should not be in 
branch-2.

Two options from where I stand
 # For folks who worked on the patch: Is there a way to make (a) the 
upgrade-downgrade seamless for people who don’t care about this (b) and have 
explicit documentation for people who care to switch this behavior on and are 
willing to risk not having downgrades. If this means a new configuration 
property, so be it. It’s a necessary evil.
 # Just let specific users backport this into specific 2.x branches they need 
and leave it only on trunk.

Unless this behavior stops breaking rolling upgrades/downgrades, I think we 
should just revert it from branch-2 and definitely 2.7.3 as it stands today.

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Fix For: 2.7.3
>
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v3-bin.patch, 
> hadoop-56-layout-datanode-dir.tgz, test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225482#comment-15225482
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

bq.  I didn't think that it broke rolling upgrade (you should still be able to 
upgrade from an earlier layout version to this one). Did I miss something?
My point was mainly about rolling downgrade. Just used upgrade/downgrade 
together in my comment because in my mind the expectations are the same.

bq. Do we actually support downgrade between 2.7 and 2.6? We changed the 
NameNode LayoutVersion, so I don't think so. These branches don't have 
HDFS-8432 either.
[~andrew.wang], tx for this info.

This is really unfortunate. Can you give a reference to the NameNode 
LayoutVersion change?

Did we ever establish clear rules about downgrades? We need to layout out our 
story around supporting downgrades continuously and codify it. I'd vote for 
keeping strict rules for downgrades too, otherwise users are left to fend for 
themselves in deciding the risk associated with every version upgrade - are we 
in a place where we can support this?

For upgrades, there is tribal knowledge amongst committers/reviewers in the 
minimum. And on YARN side, we've proposed (but made little progress) for tools 
to automatically catch some of it - YARN-3292.

To conclude, is the consensus to document all these downgrade related breakages 
but keep them in 2.7.x and 2.8?

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Fix For: 2.7.3
>
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v3-bin.patch, 
> hadoop-56-layout-datanode-dir.tgz, test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2016-04-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229398#comment-15229398
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

Tx [~kihwal]!

As [~kihwal] pointed out 
[here|https://issues.apache.org/jira/browse/HDFS-8791?focusedCommentId=15226253&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15226253],
 we had downgrades even in dot releases as a requirement in the original design 
doc, but we haven't been respecting those.

Before we move on, I think we should converge to a plan of action for future 
changes that may affect downgrade scenarios - perhaps as a different JIRA - 
[~cnauroth] / [~kihwal] / [~szetszwo], is it possible for one of you to take 
charge on this? Thanks!

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Fix For: 2.8.0
>
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v3-bin.patch, 
> hadoop-56-layout-datanode-dir.tgz, test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8871) Decommissioning of a node with a failed volume may not start

2016-04-10 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8871:
--
Target Version/s:   (was: 2.7.3, 2.6.5)

Removing target-version off this long-standing issue, please add it back once 
there is a patch available for releasing. Tx.

> Decommissioning of a node with a failed volume may not start
> 
>
> Key: HDFS-8871
> URL: https://issues.apache.org/jira/browse/HDFS-8871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Kihwal Lee
>Assignee: Daryn Sharp
>Priority: Critical
>
> Since staleness may not be properly cleared, a node with a failed volume may 
> not actually get scanned for block replication. Nothing is being replicated 
> from these nodes.
> This bug does not manifest unless the datanode has a unique storage ID per 
> volume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8674) Improve performance of postponed block scans

2016-04-10 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8674:
--
Target Version/s: 2.7.4  (was: 2.7.3)

Haven't gotten any update, dropping this off into 2.7.4..

> Improve performance of postponed block scans
> 
>
> Key: HDFS-8674
> URL: https://issues.apache.org/jira/browse/HDFS-8674
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-8674.patch, HDFS-8674.patch
>
>
> When a standby goes active, it marks all nodes as "stale" which will cause 
> block invalidations for over-replicated blocks to be queued until full block 
> reports are received from the nodes with the block.  The replication monitor 
> scans the queue with O(N) runtime.  It picks a random offset and iterates 
> through the set to randomize blocks scanned.
> The result is devastating when a cluster loses multiple nodes during a 
> rolling upgrade. Re-replication occurs, the nodes come back, the excess block 
> invalidations are postponed. Rescanning just 2k blocks out of millions of 
> postponed blocks may take multiple seconds. During the scan, the write lock 
> is held which stalls all other processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10271) Extra bytes are getting released from reservedSpace for append

2016-04-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235758#comment-15235758
 ] 

Vinod Kumar Vavilapalli commented on HDFS-10271:


Can somebody from HDFS please help review / commit this? It's one of the last 
few blocking tickets for 2.7.3!

> Extra bytes are getting released from reservedSpace for append
> --
>
> Key: HDFS-10271
> URL: https://issues.apache.org/jira/browse/HDFS-10271
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Critical
> Attachments: HDFS-10271-01.patch, HDFS-10271-branch-2.7-01.patch
>
>
> 1. File already have some bytes available in block. (ex: 1024B)
> 2. Re-open the file for append, (Here reserving for (BlockSize-1024) bytes)
> 3. write one byte and flush, 
> 4. close()
> After close(), releasing *BlockSize-1* bytes from reservedspace instead of 
> *BlockSize-1025* bytes.
> Extra bytes reserved may create problem for other writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9426) Rollingupgrade finalization is not backward compatible

2015-11-18 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012255#comment-15012255
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9426:
---

[~kihwal] / [~vinayrpet], any update on this?

> Rollingupgrade finalization is not backward compatible
> --
>
> Key: HDFS-9426
> URL: https://issues.apache.org/jira/browse/HDFS-9426
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Blocker
> Attachments: HDFS-9426.branch-2.7.poc.patch, HDFS-9426.trunk.poc.patch
>
>
> After HDFS-7645, the namenode can return non-null {{rollingUpgradeInfo}} in 
> heatbeat reponses. 2.7.1 or 2.6.x datanodes won't finalize the upgrade 
> because it's not null.
> NN might have to check the DN version and return different 
> {{rollingUpgradeInfo}}.
> HDFS-8656 recognized the compatibility issue of the changed semantics, but 
> unfortunately did not address the semantics of the heartbeat response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9273) ACLs on root directory may be lost after NN restart

2015-11-25 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9273:
--
Target Version/s: 2.7.2, 2.6.3

bq. Should this fix be included in the 2.6 / 2.7 maintenance releases? Losing 
ACL on restart seems pretty serious.
Tentatively adding for 2.6.3 and 2.7.2.

[~xiaochen], [~cnauroth], know if this is this a long standing bug or broken 
recently?

> ACLs on root directory may be lost after NN restart
> ---
>
> Key: HDFS-9273
> URL: https://issues.apache.org/jira/browse/HDFS-9273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Fix For: 2.8.0
>
> Attachments: HDFS-9273.001.patch, HDFS-9273.002.patch
>
>
> After restarting namenode, the ACLs on the root directory ("/") may be lost 
> if it's rolled over to fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9273) ACLs on root directory may be lost after NN restart

2015-11-25 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9273:
--
Priority: Critical  (was: Major)

> ACLs on root directory may be lost after NN restart
> ---
>
> Key: HDFS-9273
> URL: https://issues.apache.org/jira/browse/HDFS-9273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9273.001.patch, HDFS-9273.002.patch
>
>
> After restarting namenode, the ACLs on the root directory ("/") may be lost 
> if it's rolled over to fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9445) Deadlock in datanode

2015-11-25 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027854#comment-15027854
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9445:
---

[~kihwal] / [~walter.k.su], I am holding off 2.7.2 for this one, please let me 
know if this needs more time. Thanks.

> Deadlock in datanode
> 
>
> Key: HDFS-9445
> URL: https://issues.apache.org/jira/browse/HDFS-9445
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Kihwal Lee
>Assignee: Walter Su
>Priority: Blocker
> Attachments: HDFS-9445.00.patch
>
>
> {noformat}
> Found one Java-level deadlock:
> =
> "DataXceiver for client DFSClient_attempt_xxx at /1.2.3.4:100 [Sending block 
> BP-x:blk_123_456]":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> "Thread-565":
>   waiting for ownable synchronizer 0xd55613c8, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "DataNode: heartbeating to my-nn:8020"
> "DataNode: heartbeating to my-nn:8020":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9426) Rollingupgrade finalization is not backward compatible

2015-11-25 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027859#comment-15027859
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9426:
---

[~vinayrpet], can you look at [~kihwal]'s patch and get it reviewed / 
committed? I am holding off 2.7.2 for this one, please let me know if this 
needs more time. Thanks.

> Rollingupgrade finalization is not backward compatible
> --
>
> Key: HDFS-9426
> URL: https://issues.apache.org/jira/browse/HDFS-9426
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Blocker
> Attachments: HDFS-9426.branch-2.7.poc.patch, HDFS-9426.trunk.poc.patch
>
>
> After HDFS-7645, the namenode can return non-null {{rollingUpgradeInfo}} in 
> heatbeat reponses. 2.7.1 or 2.6.x datanodes won't finalize the upgrade 
> because it's not null.
> NN might have to check the DN version and return different 
> {{rollingUpgradeInfo}}.
> HDFS-8656 recognized the compatibility issue of the changed semantics, but 
> unfortunately did not address the semantics of the heartbeat response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-12-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1503#comment-1503
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

Agree with Kihwal that the release discussion is besides the point.

The question is what will happen to existing users' clusters be they are based 
off 2.7.x or 2.8.x or 2.9.x.

Dumbing it down, I think there are (1) users who care about the perf issue and 
can manage the resulting storage layout change and (2) there will be those who 
won't. Making the upgrade (as well as downgrade?) seamlessly work as part of 
our code pretty much seems like a blocker in order to avoid surprises for 
unsuspecting users who are not in need of this change.

Forcing a manual step for a 2.6.x / 2.7.x user when he/she upgrades to 2.6.4, 
2.7.3 or 2.8.0 seems like a non-starter to me.

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, hadoop-56-layout-datanode-dir.tgz
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-12-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8791:
--
Target Version/s: 2.8.0, 2.7.3, 2.6.3  (was: 2.6.3)

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, hadoop-56-layout-datanode-dir.tgz, 
> test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-12-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042332#comment-15042332
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

Setting tentative target-versions 2.8.0, 2.7.3 for tracking.

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, hadoop-56-layout-datanode-dir.tgz, 
> test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9445) Deadlock in datanode

2015-12-08 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047314#comment-15047314
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9445:
---

Bump! Any progress on this? Blocked on this for the 2.7.2 release.

> Deadlock in datanode
> 
>
> Key: HDFS-9445
> URL: https://issues.apache.org/jira/browse/HDFS-9445
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Kihwal Lee
>Assignee: Walter Su
>Priority: Blocker
> Attachments: HDFS-9445.00.patch, HDFS-9445.01.patch, 
> HDFS-9445.02.patch
>
>
> {noformat}
> Found one Java-level deadlock:
> =
> "DataXceiver for client DFSClient_attempt_xxx at /1.2.3.4:100 [Sending block 
> BP-x:blk_123_456]":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> "Thread-565":
>   waiting for ownable synchronizer 0xd55613c8, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "DataNode: heartbeating to my-nn:8020"
> "DataNode: heartbeating to my-nn:8020":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9273) ACLs on root directory may be lost after NN restart

2015-12-08 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9273:
--
Fix Version/s: 2.6.3
   2.7.3

Backported this to 2.6.3 and 2.7.2.

> ACLs on root directory may be lost after NN restart
> ---
>
> Key: HDFS-9273
> URL: https://issues.apache.org/jira/browse/HDFS-9273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Critical
> Fix For: 2.8.0, 2.7.3, 2.6.3
>
> Attachments: HDFS-9273.001.patch, HDFS-9273.002.patch
>
>
> After restarting namenode, the ACLs on the root directory ("/") may be lost 
> if it's rolled over to fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9273) ACLs on root directory may be lost after NN restart

2015-12-08 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9273:
--
Fix Version/s: (was: 2.7.3)
   2.7.2

> ACLs on root directory may be lost after NN restart
> ---
>
> Key: HDFS-9273
> URL: https://issues.apache.org/jira/browse/HDFS-9273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Critical
> Fix For: 2.8.0, 2.7.2, 2.6.3
>
> Attachments: HDFS-9273.001.patch, HDFS-9273.002.patch
>
>
> After restarting namenode, the ACLs on the root directory ("/") may be lost 
> if it's rolled over to fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9516) truncate file fails with data dirs on multiple disks

2015-12-16 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9516:
--
Target Version/s: 2.8.0, 2.7.3

[~shv], unfortunately this came in too late for 2.7.2. That said, I don’t see 
any reason why this shouldn’t be in 2.8.0 and 2.7.3. Setting the 
target-versions accordingly on JIRA.

If you agree, appreciate backport help to those branches (branch-2.8.0, 
branch-2.7).


> truncate file fails with data dirs on multiple disks
> 
>
> Key: HDFS-9516
> URL: https://issues.apache.org/jira/browse/HDFS-9516
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: Bogdan Raducanu
>Assignee: Plamen Jeliazkov
> Fix For: 2.9.0
>
> Attachments: HDFS-9516_1.patch, HDFS-9516_2.patch, HDFS-9516_3.patch, 
> HDFS-9516_testFailures.patch, Main.java, truncate.dn.log
>
>
> FileSystem.truncate returns false (no exception) but the file is never closed 
> and not writable after this.
> It seems to be because of copy on truncate which is used because the system 
> is in upgrade state. In this case a rename between devices is attempted.
> See attached log and repro code.
> Probably also affects truncate snapshotted file when copy on truncate is also 
> used.
> Possibly it affects not only truncate but any block recovery.
> I think the problem is in updateReplicaUnderRecovery
> {code}
> ReplicaBeingWritten newReplicaInfo = new ReplicaBeingWritten(
> newBlockId, recoveryId, rur.getVolume(), 
> blockFile.getParentFile(),
> newlength);
> {code}
> blockFile is created with copyReplicaWithNewBlockIdAndGS which is allowed to 
> choose any volume so rur.getVolume() is not where the block is located.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8785) TestDistributedFileSystem is failing in trunk

2015-12-16 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061194#comment-15061194
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8785:
---

[~yzhangal], in addition to (or earlier instead of?) committing to branch-2.8, 
you ended up creating a new branch named _brachn-2.8_. Given the ASF wide 
branch related lock-down, that branch can now not be deleted, atleast for a 
while *sigh*.

> TestDistributedFileSystem is failing in trunk
> -
>
> Key: HDFS-8785
> URL: https://issues.apache.org/jira/browse/HDFS-8785
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Arpit Agarwal
>Assignee: Xiaoyu Yao
> Fix For: 2.8.0
>
> Attachments: HDFS-8785.00.patch, HDFS-8785.01.patch, 
> HDFS-8785.02.patch
>
>
> A newly added test case 
> {{TestDistributedFileSystem#testDFSClientPeerWriteTimeout}} is failing in 
> trunk.
> e.g. run
> https://builds.apache.org/job/PreCommit-HDFS-Build/11716/testReport/org.apache.hadoop.hdfs/TestDistributedFileSystem/testDFSClientPeerWriteTimeout/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9463) The 'decommissioned' and 'decommissionedAndDead' icons do not show up alongside relevant Datanodes on the NN-UI

2015-12-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9463:
--
Fix Version/s: (was: 2.8.0)

> The 'decommissioned' and 'decommissionedAndDead' icons do not show up 
> alongside relevant Datanodes on the NN-UI
> ---
>
> Key: HDFS-9463
> URL: https://issues.apache.org/jira/browse/HDFS-9463
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: HDFS-9463-1.patch, Screen Shot 2015-11-24 at 3.23.00 
> PM.png
>
>
> The wrench sign for 'live and decommissioned' nodes and the power down sign 
> for 'decommissioned and dead' nodes is absent for datanodes on the NN-UI 
> (2.7).  
> The fix is to correct the spellings of 'decommissioned'  in relevant class 
> value in hadoop-css and dfshealth.html. Also correcting the spelling of 
> Decommissioning in dfshealth.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9470) Encryption zone on root not loaded from fsimage after NN restart

2015-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070340#comment-15070340
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9470:
---

This originally never made it to branch-2.7.2 even though the fix version is 
set so. Tx to [~djp] for catching this.

I just cherry-picked it for rolling a new RC for 2.7.2. FYI.

> Encryption zone on root not loaded from fsimage after NN restart
> 
>
> Key: HDFS-9470
> URL: https://issues.apache.org/jira/browse/HDFS-9470
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Critical
> Fix For: 2.7.2, 2.6.3
>
> Attachments: HDFS-9470.001.patch, HDFS-9470.002.patch, 
> HDFS-9470.003.patch
>
>
> When restarting namenode, the encryption zone for {{rootDir}} is not loaded 
> correctly from fsimage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9434) Recommission a datanode with 500k blocks may pause NN for 30 seconds

2015-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9434:
--
Fix Version/s: 2.7.2

> Recommission a datanode with 500k blocks may pause NN for 30 seconds
> 
>
> Key: HDFS-9434
> URL: https://issues.apache.org/jira/browse/HDFS-9434
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.7.2, 2.6.3
>
> Attachments: h9434_20151116.patch, h9434_20151116_branch-2.6.patch
>
>
> In BlockManager, processOverReplicatedBlocksOnReCommission is called within 
> the namespace lock.  There is a (not very useful) log message printed in 
> processOverReplicatedBlock.  When there is a large number of blocks stored in 
> a storage, printing the log message for each block can pause NN to process 
> any other operations.  We did see that it could pause NN  for 30 seconds for 
> a storage with 500k blocks.
> I suggest to change the log message to trace level as a quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8615) Correct HTTP method in WebHDFS document

2016-01-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8615:
--
Fix Version/s: (was: 2.7.3)
   (was: 2.8.0)
   2.7.2

Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing 
fix-versions to reflect the same.

> Correct HTTP method in WebHDFS document
> ---
>
> Key: HDFS-8615
> URL: https://issues.apache.org/jira/browse/HDFS-8615
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.4.1
>Reporter: Akira AJISAKA
>Assignee: Brahma Reddy Battula
>  Labels: newbie
> Fix For: 2.7.2, 2.6.3
>
> Attachments: HDFS-8615.branch-2.6.patch, HDFS-8615.patch
>
>
> For example, {{-X PUT}} should be removed from the following curl command.
> {code:title=WebHDFS.md}
> ### Get ACL Status
> * Submit a HTTP GET request.
> curl -i -X PUT 
> "http://:/webhdfs/v1/?op=GETACLSTATUS"
> {code}
> Other than this example, there are several commands which {{-X PUT}} should 
> be removed from. We should fix them all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9431) DistributedFileSystem#concat fails if the target path is relative.

2016-01-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9431:
--
Fix Version/s: (was: 2.7.3)
   (was: 2.8.0)
   2.7.2

Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing 
fix-versions to reflect the same.

> DistributedFileSystem#concat fails if the target path is relative.
> --
>
> Key: HDFS-9431
> URL: https://issues.apache.org/jira/browse/HDFS-9431
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Kazuho Fujii
>Assignee: Kazuho Fujii
> Fix For: 2.7.2, 2.6.3
>
> Attachments: HDFS-9431.001.patch, HDFS-9431.002.patch
>
>
> {{DistributedFileSystem#concat}} fails if the target path is relative.
> The method tries to send a relative path to DFSClient at the first call.
> bq.  dfs.concat(getPathName(trg), srcsStr);
> But, {{getPathName}} failed. It seems that {{trg}} should be {{absF}} like 
> the second call.
> bq.  dfs.concat(getPathName(absF), srcsStr);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9289) Make DataStreamer#block thread safe and verify genStamp in commitBlock

2016-01-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9289:
--
Fix Version/s: (was: 2.7.3)
   (was: 3.0.0)
   2.7.2

> Make DataStreamer#block thread safe and verify genStamp in commitBlock
> --
>
> Key: HDFS-9289
> URL: https://issues.apache.org/jira/browse/HDFS-9289
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Critical
> Fix For: 2.7.2, 2.6.3
>
> Attachments: HDFS-9289-branch-2.6.patch, HDFS-9289.1.patch, 
> HDFS-9289.2.patch, HDFS-9289.3.patch, HDFS-9289.4.patch, HDFS-9289.5.patch, 
> HDFS-9289.6.patch, HDFS-9289.7.patch, HDFS-9289.branch-2.7.patch, 
> HDFS-9289.branch-2.patch
>
>
> we have seen a case of corrupt block which is caused by file complete after a 
> pipelineUpdate, but the file complete with the old block genStamp. This 
> caused the replicas of two datanodes in updated pipeline to be viewed as 
> corrupte. Propose to check genstamp when commit block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9289) Make DataStreamer#block thread safe and verify genStamp in commitBlock

2016-01-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097181#comment-15097181
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9289:
---

Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing 
fix-versions to reflect the same.

> Make DataStreamer#block thread safe and verify genStamp in commitBlock
> --
>
> Key: HDFS-9289
> URL: https://issues.apache.org/jira/browse/HDFS-9289
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Critical
> Fix For: 2.7.2, 2.6.3
>
> Attachments: HDFS-9289-branch-2.6.patch, HDFS-9289.1.patch, 
> HDFS-9289.2.patch, HDFS-9289.3.patch, HDFS-9289.4.patch, HDFS-9289.5.patch, 
> HDFS-9289.6.patch, HDFS-9289.7.patch, HDFS-9289.branch-2.7.patch, 
> HDFS-9289.branch-2.patch
>
>
> we have seen a case of corrupt block which is caused by file complete after a 
> pipelineUpdate, but the file complete with the old block genStamp. This 
> caused the replicas of two datanodes in updated pipeline to be viewed as 
> corrupte. Propose to check genstamp when commit block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2016-01-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097348#comment-15097348
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9493:
---

[~eddyxu], there is a branch-2.8 where you need to land this patch for it to 
make to 2.8.0.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Fix For: 2.8.0
>
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch, 
> HDFS-9493.003.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9597) BaseReplicationPolicyTest should update data node stats after adding a data node

2016-01-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097351#comment-15097351
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9597:
---

[~benoyantony], there is a branch-2.8 where you need to land this patch for it 
to be in 2.8.0.

> BaseReplicationPolicyTest should update data node stats after adding a data 
> node
> 
>
> Key: HDFS-9597
> URL: https://issues.apache.org/jira/browse/HDFS-9597
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Blocker
> Fix For: 2.8.0
>
> Attachments: HDFS-9597.001.patch
>
>
> Looks like HDFS-9034 broke 
> TestReplicationPolicyConsiderLoad#testChooseTargetWithDecomNodes.
> This test has been failing since yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9574) Reduce client failures during datanode restart

2016-01-14 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9574:
--
Fix Version/s: (was: 3.0.0)

bq. This is an important improvement for rolling upgrades. Committed to 
branch-2.7 and branch-2.6.
+1 for this after the fact.

> Reduce client failures during datanode restart
> --
>
> Key: HDFS-9574
> URL: https://issues.apache.org/jira/browse/HDFS-9574
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Fix For: 2.7.2, 2.6.4
>
> Attachments: HDFS-9574.patch, HDFS-9574.v2.patch, 
> HDFS-9574.v3.br26.patch, HDFS-9574.v3.br27.patch, HDFS-9574.v3.patch
>
>
> Since DataXceiverServer is initialized before BP is fully up, client requests 
> will fail until the datanode registers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9666:
--
Fix Version/s: (was: 2.7.2)

[~aderen], please use Target Version for your intention and leave out the 
fix-version for committers to fill in at commit time. FYI, fixed this JIRA 
myself. Tx.

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: ade
> Attachments: HDFS-9666.0.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-815) FileContext tests fail on Windows

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-815:
-
Fix Version/s: (was: 2.7.2)

> FileContext tests fail on Windows
> -
>
> Key: HDFS-815
> URL: https://issues.apache.org/jira/browse/HDFS-815
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
> Environment: Windows
>Reporter: Konstantin Shvachko
>
> The following FileContext-related tests are failing on windows because of 
> incorrect use "test.build.data" system property for setting hdfs paths, which 
> end up containing "C:" as a path component, which hdfs does not support.
> {code}
> org.apache.hadoop.fs.TestFcHdfsCreateMkdir
> org.apache.hadoop.fs.TestFcHdfsPermission
> org.apache.hadoop.fs.TestHDFSFileContextMainOperations
> org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8676) Delayed rolling upgrade finalization can cause heartbeat expiration and write failures

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8676:
--
Fix Version/s: (was: 3.0.0)

> Delayed rolling upgrade finalization can cause heartbeat expiration and write 
> failures
> --
>
> Key: HDFS-8676
> URL: https://issues.apache.org/jira/browse/HDFS-8676
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Walter Su
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8676.01.patch, HDFS-8676.02.patch
>
>
> In big busy clusters where the deletion rate is also high, a lot of blocks 
> can pile up in the datanode trash directories until an upgrade is finalized.  
> When it is finally finalized, the deletion of trash is done in the service 
> actor thread's context synchronously.  This blocks the heartbeat and can 
> cause heartbeat expiration.  
> We have seen a namenode losing hundreds of nodes after a delayed upgrade 
> finalization.  The deletion of trash directories should be made asynchronous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8950) NameNode refresh doesn't remove DataNodes that are no longer in the allowed list

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8950:
--
Fix Version/s: (was: 3.0.0)

> NameNode refresh doesn't remove DataNodes that are no longer in the allowed 
> list
> 
>
> Key: HDFS-8950
> URL: https://issues.apache.org/jira/browse/HDFS-8950
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>  Labels: 2.7.2-candidate
> Fix For: 2.7.2
>
> Attachments: HDFS-8950.001.patch, HDFS-8950.002.patch, 
> HDFS-8950.003.patch, HDFS-8950.004.patch, HDFS-8950.005.patch, 
> HDFS-8950.branch-2.7.patch
>
>
> If you remove a DN from NN's allowed host list (HDFS was HA) and then do NN 
> refresh, it doesn't remove it actually and the NN UI keeps showing that node. 
> It may try to allocate some blocks to that DN as well during an MR job.  This 
> issue is independent from DN decommission.
> To reproduce:
> 1. Add a DN to dfs_hosts_allow
> 2. Refresh NN
> 3. Start DN. Now NN starts seeing DN.
> 4. Stop DN
> 5. Remove DN from dfs_hosts_allow
> 6. Refresh NN -> NN is still reporting DN as being used by HDFS.
> This is different from decom because there DN is added to exclude list in 
> addition to being removed from allowed list, and in that case everything 
> works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9220) Reading small file (< 512 bytes) that is open for append fails due to incorrect checksum

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9220:
--
Fix Version/s: (was: 3.0.0)

> Reading small file (< 512 bytes) that is open for append fails due to 
> incorrect checksum
> 
>
> Key: HDFS-9220
> URL: https://issues.apache.org/jira/browse/HDFS-9220
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Bogdan Raducanu
>Assignee: Jing Zhao
>Priority: Blocker
> Fix For: 2.7.2
>
> Attachments: HDFS-9220.000.patch, HDFS-9220.001.patch, 
> HDFS-9220.002.patch, test2.java
>
>
> Exception:
> 2015-10-09 14:59:40 WARN  DFSClient:1150 - fetchBlockByteRange(). Got a 
> checksum exception for /tmp/file0.05355529331575182 at 
> BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882:0 from 
> DatanodeInfoWithStorage[10.10.10.10]:5001
> All 3 replicas cause this exception and the read fails entirely with:
> BlockMissingException: Could not obtain block: 
> BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882 
> file=/tmp/file0.05355529331575182
> Code to reproduce is attached.
> Does not happen in 2.7.0.
> Data is read correctly if checksum verification is disabled.
> More generally, the failure happens when reading from the last block of a 
> file and the last block has <= 512 bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8850:
--
Fix Version/s: (was: 3.0.0)

> VolumeScanner thread exits with exception if there is no block pool to be 
> scanned but there are suspicious blocks
> -
>
> Key: HDFS-8850
> URL: https://issues.apache.org/jira/browse/HDFS-8850
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Fix For: 2.7.2
>
> Attachments: HDFS-8850.001.patch
>
>
> The VolumeScanner threads inside the BlockScanner exit with an exception if 
> there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8879) Quota by storage type usage incorrectly initialized upon namenode restart

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8879:
--
Fix Version/s: (was: 3.0.0)

> Quota by storage type usage incorrectly initialized upon namenode restart
> -
>
> Key: HDFS-8879
> URL: https://issues.apache.org/jira/browse/HDFS-8879
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Kihwal Lee
>Assignee: Xiaoyu Yao
> Fix For: 2.7.2
>
> Attachments: HDFS-8879.01.patch
>
>
> This was found by [~kihwal] as part of HDFS-8865 work in this 
> [comment|https://issues.apache.org/jira/browse/HDFS-8865?focusedCommentId=14660904&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14660904].
> The unit test 
> testQuotaByStorageTypePersistenceInFsImage/testQuotaByStorageTypePersistenceInFsEdit
>  failed to detect this because they were using an obsolete
> FsDirectory instance. Once added the highlighted line below, the issue can be 
> reproed.
> {code}
> >fsdir = cluster.getNamesystem().getFSDirectory();
> INode testDirNodeAfterNNRestart = fsdir.getINode4Write(testDir.toString());
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7645:
--
Fix Version/s: (was: 3.0.0)

> Rolling upgrade is restoring blocks from trash multiple times
> -
>
> Key: HDFS-7645
> URL: https://issues.apache.org/jira/browse/HDFS-7645
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0
>Reporter: Nathan Roberts
>Assignee: Keisuke Ogiwara
> Fix For: 2.7.2
>
> Attachments: HDFS-7645.01.patch, HDFS-7645.02.patch, 
> HDFS-7645.03.patch, HDFS-7645.04.patch, HDFS-7645.05.patch, 
> HDFS-7645.06.patch, HDFS-7645.07.patch
>
>
> When performing an HDFS rolling upgrade, the trash directory is getting 
> restored twice when under normal circumstances it shouldn't need to be 
> restored at all. iiuc, the only time these blocks should be restored is if we 
> need to rollback a rolling upgrade. 
> On a busy cluster, this can cause significant and unnecessary block churn 
> both on the datanodes, and more importantly in the namenode.
> The two times this happens are:
> 1) restart of DN onto new software
> {code}
>   private void doTransition(DataNode datanode, StorageDirectory sd,
>   NamespaceInfo nsInfo, StartupOption startOpt) throws IOException {
> if (startOpt == StartupOption.ROLLBACK && sd.getPreviousDir().exists()) {
>   Preconditions.checkState(!getTrashRootDir(sd).exists(),
>   sd.getPreviousDir() + " and " + getTrashRootDir(sd) + " should not 
> " +
>   " both be present.");
>   doRollback(sd, nsInfo); // rollback if applicable
> } else {
>   // Restore all the files in the trash. The restored files are retained
>   // during rolling upgrade rollback. They are deleted during rolling
>   // upgrade downgrade.
>   int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
>   LOG.info("Restored " + restored + " block files from trash.");
> }
> {code}
> 2) When heartbeat response no longer indicates a rollingupgrade is in progress
> {code}
>   /**
>* Signal the current rolling upgrade status as indicated by the NN.
>* @param inProgress true if a rolling upgrade is in progress
>*/
>   void signalRollingUpgrade(boolean inProgress) throws IOException {
> String bpid = getBlockPoolId();
> if (inProgress) {
>   dn.getFSDataset().enableTrash(bpid);
>   dn.getFSDataset().setRollingUpgradeMarker(bpid);
> } else {
>   dn.getFSDataset().restoreTrash(bpid);
>   dn.getFSDataset().clearRollingUpgradeMarker(bpid);
> }
>   }
> {code}
> HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely 
> clear whether this is somehow intentional. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9221) HdfsServerConstants#ReplicaState#getState should avoid calling values() since it creates a temporary array

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9221:
--
Fix Version/s: (was: 3.0.0)

> HdfsServerConstants#ReplicaState#getState should avoid calling values() since 
> it creates a temporary array
> --
>
> Key: HDFS-9221
> URL: https://issues.apache.org/jira/browse/HDFS-9221
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: performance
>Affects Versions: 2.7.1
>Reporter: Staffan Friberg
>Assignee: Staffan Friberg
> Fix For: 2.7.2
>
> Attachments: HADOOP-9221.001.patch
>
>
> When the BufferDecoder in BlockListAsLongs converts the stored value to a 
> ReplicaState enum it calls ReplicaState.getState(int) unfortunately this 
> method creates a ReplicaState[] for each call since it calls 
> ReplicaState.values().
> This patch creates a cached version of the values and thus avoid all 
> allocation when doing the conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8656) Preserve compatibility of ClientProtocol#rollingUpgrade after finalization

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8656:
--
Fix Version/s: (was: 3.0.0)

> Preserve compatibility of ClientProtocol#rollingUpgrade after finalization
> --
>
> Key: HDFS-8656
> URL: https://issues.apache.org/jira/browse/HDFS-8656
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.8.0
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: hdfs-8656.001.patch, hdfs-8656.002.patch, 
> hdfs-8656.003.patch, hdfs-8656.004.patch
>
>
> HDFS-7645 changed rollingUpgradeInfo to still return an RUInfo after 
> finalization, so the DNs can differentiate between rollback and a 
> finalization. However, this breaks compatibility for the user facing APIs, 
> which always expect a null after finalization. Let's fix this and edify it in 
> unit tests.
> As an additional improvement, isFinalized and isStarted are part of the Java 
> API, but not in the JMX output of RollingUpgradeInfo. It'd be nice to expose 
> these booleans so JMX users don't need to do the != 0 check that possibly 
> exposes our implementation details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9178:
--
Fix Version/s: (was: 3.0.0)

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9290:
--
Fix Version/s: (was: 3.0.0)

> DFSClient#callAppend() is not backward compatible for slightly older NameNodes
> --
>
> Key: HDFS-9290
> URL: https://issues.apache.org/jira/browse/HDFS-9290
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Blocker
> Fix For: 2.7.2
>
> Attachments: HDFS-9290.001.patch, HDFS-9290.002.patch
>
>
> HDFS-7210 combined 2 RPC calls used at file append into a single one. 
> Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
> compatibility for older client is handled by the new NameNode (protobuf). 
> Newer client's {{append()}} call does not work with older NameNodes. One will 
> run into an exception like the following:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
> at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
> at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
> {code}
> The cause is that the new client code is expecting both the last block and 
> file info in the same RPC but the old NameNode only replied with the first. 
> The exception itself does not reflect this and one will have to look at the 
> HDFS source code to really understand what happened.
> We can have the client detect it's talking to a old NameNode and send an 
> extra {{getFileInfo()}} RPC. Or we should improve the exception being thrown 
> to accurately reflect the cause of failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9445) Datanode may deadlock while handling a bad volume

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9445:
--
Fix Version/s: (was: 3.0.0)

> Datanode may deadlock while handling a bad volume
> -
>
> Key: HDFS-9445
> URL: https://issues.apache.org/jira/browse/HDFS-9445
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Kihwal Lee
>Assignee: Walter Su
>Priority: Blocker
> Fix For: 2.7.2, 2.6.4
>
> Attachments: HDFS-9445-branch-2.6.02.patch, 
> HDFS-9445-branch-2.6_02.patch, HDFS-9445.00.patch, HDFS-9445.01.patch, 
> HDFS-9445.02.patch
>
>
> {noformat}
> Found one Java-level deadlock:
> =
> "DataXceiver for client DFSClient_attempt_xxx at /1.2.3.4:100 [Sending block 
> BP-x:blk_123_456]":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> "Thread-565":
>   waiting for ownable synchronizer 0xd55613c8, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "DataNode: heartbeating to my-nn:8020"
> "DataNode: heartbeating to my-nn:8020":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8384) Allow NN to startup if there are files having a lease but are not under construction

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8384:
--
Fix Version/s: (was: 2.8.0)

> Allow NN to startup if there are files having a lease but are not under 
> construction
> 
>
> Key: HDFS-8384
> URL: https://issues.apache.org/jira/browse/HDFS-8384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Jing Zhao
>Priority: Minor
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8384-branch-2.6.patch, HDFS-8384-branch-2.7.patch, 
> HDFS-8384.000.patch
>
>
> When there are files having a lease but are not under construction, NN will 
> fail to start up with
> {code}
> 15/05/12 00:36:31 ERROR namenode.FSImage: Unable to save image for 
> /hadoop/hdfs/namenode
> java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodesUnderConstruction(LeaseManager.java:412)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFilesUnderConstruction(FSNamesystem.java:7124)
> ...
> {code}
> The actually problem is that the image could be corrupted by bugs like 
> HDFS-7587.  We should have an option/conf to allow NN to start up so that the 
> problematic files could possibly be deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7609) Avoid retry cache collision when Standby NameNode loading edits

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7609:
--
Fix Version/s: (was: 2.8.0)

> Avoid retry cache collision when Standby NameNode loading edits
> ---
>
> Key: HDFS-7609
> URL: https://issues.apache.org/jira/browse/HDFS-7609
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.2.0
>Reporter: Carrey Zhan
>Assignee: Ming Ma
>Priority: Critical
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-7609-2.patch, HDFS-7609-3.patch, 
> HDFS-7609-CreateEditsLogWithRPCIDs.patch, HDFS-7609-branch-2.7.2.txt, 
> HDFS-7609.patch, recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same 
> time under very high load, leaving behind about 100 million transactions in 
> edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be 
> needed before finish, and it was loading fsedits most of the time. I also 
> tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache. 
> So I set dfs.namenode.enable.retrycache to false, the restart process 
> finished in half an hour.
> I think the retry cached is useless during startup, at least during recover 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8846) Add a unit test for INotify functionality across a layout version upgrade

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8846:
--
Fix Version/s: (was: 2.8.0)

> Add a unit test for INotify functionality across a layout version upgrade
> -
>
> Key: HDFS-8846
> URL: https://issues.apache.org/jira/browse/HDFS-8846
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8846-branch-2.6.1.txt, HDFS-8846.00.patch, 
> HDFS-8846.01.patch, HDFS-8846.02.patch, HDFS-8846.03.patch
>
>
> Per discussion under HDFS-8480, we should create some edit log files with old 
> layout version, to test whether they can be correctly handled in upgrades.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9294) DFSClient deadlock when close file and failed to renew lease

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9294:
--
Fix Version/s: (was: 2.8.0)

> DFSClient  deadlock when close file and failed to renew lease
> -
>
> Key: HDFS-9294
> URL: https://issues.apache.org/jira/browse/HDFS-9294
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.2.0, 2.7.1
> Environment: Hadoop 2.2.0
>Reporter: DENG FEI
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Fix For: 2.7.2, 2.6.4
>
> Attachments: HDFS-9294-002.patch, HDFS-9294-002.patch, 
> HDFS-9294-branch-2.6.patch, HDFS-9294-branch-2.7.patch, 
> HDFS-9294-branch-2.patch, HDFS-9294.patch
>
>
> We found a deadlock at our HBase(0.98) cluster(and the Hadoop Version is 
> 2.2.0),and it should be HDFS BUG,at the time our network is not stable.
>  below is the stack:
> *
> Found one Java-level deadlock:
> =
> "MemStoreFlusher.1":
>   waiting to lock monitor 0x7ff27cfa5218 (object 0x0002fae5ebe0, a 
> org.apache.hadoop.hdfs.LeaseRenewer),
>   which is held by "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel"
> "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel":
>   waiting to lock monitor 0x7ff2e67e16a8 (object 0x000486ce6620, a 
> org.apache.hadoop.hdfs.DFSOutputStream),
>   which is held by "MemStoreFlusher.0"
> "MemStoreFlusher.0":
>   waiting to lock monitor 0x7ff27cfa5218 (object 0x0002fae5ebe0, a 
> org.apache.hadoop.hdfs.LeaseRenewer),
>   which is held by "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel"
> Java stack information for the threads listed above:
> ===
> "MemStoreFlusher.1":
>   at org.apache.hadoop.hdfs.LeaseRenewer.addClient(LeaseRenewer.java:216)
>   - waiting to lock <0x0002fae5ebe0> (a 
> org.apache.hadoop.hdfs.LeaseRenewer)
>   at org.apache.hadoop.hdfs.LeaseRenewer.getInstance(LeaseRenewer.java:81)
>   at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:648)
>   at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:659)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1882)
>   - locked <0x00055b606cb0> (a org.apache.hadoop.hdfs.DFSOutputStream)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:71)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:104)
>   at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:250)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:402)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:974)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:78)
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>   - locked <0x00059869eed8> (a java.lang.Object)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:812)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:1974)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1795)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1678)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1591)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:472)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:211)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$500(MemStoreFlusher.java:66)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:238)
>   at java.lang.Thread.run(Thread.java:744)
> "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel":
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:1822)
>   - waiting to lock <0x000486ce6620> (a 
> org.apache.hadoop.hdfs.DFSOutputStream)
>   at 
> org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:780)
>   at org.apache.hadoop.hdfs.DFSClient.abort(DFSClient.java:753)
>   at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:453)
>   - locked <0x0002fae5ebe0> (a org.apache.hadoop.hdfs.LeaseRenewer)
>   at org.apache.hadoop.hdfs.LeaseRenewer.

[jira] [Updated] (HDFS-9273) ACLs on root directory may be lost after NN restart

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9273:
--
Fix Version/s: (was: 2.8.0)

> ACLs on root directory may be lost after NN restart
> ---
>
> Key: HDFS-9273
> URL: https://issues.apache.org/jira/browse/HDFS-9273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Critical
> Fix For: 2.7.2, 2.6.3
>
> Attachments: HDFS-9273.001.patch, HDFS-9273.002.patch
>
>
> After restarting namenode, the ACLs on the root directory ("/") may be lost 
> if it's rolled over to fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8219) setStoragePolicy with folder behavior is different after cluster restart

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8219:
--
Fix Version/s: (was: 2.8.0)

> setStoragePolicy with folder behavior is different after cluster restart
> 
>
> Key: HDFS-8219
> URL: https://issues.apache.org/jira/browse/HDFS-8219
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Peter Shi
>Assignee: Surendra Singh Lilhore
>  Labels: 2.6.1-candidate, 2.7.2-candidate, BB2015-05-RFC
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8219.patch, HDFS-8219.unittest-norepro.patch
>
>
> Reproduce steps.
> 1) mkdir named /temp
> 2) put one file A under /temp
> 3) change /temp storage policy to COLD
> 4) use -getStoragePolicy to query file A's storage policy, it is same with 
> /temp
> 5) change /temp folder storage policy again, will see file A's storage policy 
> keep same with parent folder.
> then restart the cluster.
> do 3) 4) again, will find file A's storage policy is not change while parent 
> folder's storage policy changes. It behaves different.
> As i debugged, found the code:
> in INodeFile.getStoragePolicyID
> {code}
>   public byte getStoragePolicyID() {
> byte id = getLocalStoragePolicyID();
> if (id == BLOCK_STORAGE_POLICY_ID_UNSPECIFIED) {
>   return this.getParent() != null ?
>   this.getParent().getStoragePolicyID() : id;
> }
> return id;
>   }
> {code}
> If the file do not have its storage policy, it will use parent's. But after 
> cluster restart, the file turns to have its own storage policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8431) hdfs crypto class not found in Windows

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8431:
--
Fix Version/s: (was: 2.8.0)

> hdfs crypto class not found in Windows
> --
>
> Key: HDFS-8431
> URL: https://issues.apache.org/jira/browse/HDFS-8431
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: scripts
>Affects Versions: 2.6.0
> Environment: Windows only
>Reporter: Sumana Sathish
>Assignee: Anu Engineer
>Priority: Critical
>  Labels: 2.6.1-candidate, 2.7.2-candidate, encryption, scripts, 
> windows
> Fix For: 2.6.1, 2.7.2
>
> Attachments: Screen Shot 2015-05-18 at 6.27.11 PM.png, 
> hdfs-8431.001.patch, hdfs-8431.002.patch
>
>
> Attached screenshot shows that hdfs could not find class 'crypto' for Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7314) When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7314:
--
Fix Version/s: (was: 2.8.0)

> When the DFSClient lease cannot be renewed, abort open-for-write files rather 
> than the entire DFSClient
> ---
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
>  Labels: 2.6.1-candidate, 2.7.2-candidate, BB2015-05-TBR
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314-8.patch, 
> HDFS-7314-9.patch, HDFS-7314-branch-2.7.2.txt, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8046) Allow better control of getContentSummary

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8046:
--
Fix Version/s: (was: 2.8.0)

> Allow better control of getContentSummary
> -
>
> Key: HDFS-8046
> URL: https://issues.apache.org/jira/browse/HDFS-8046
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8046-branch-2.6.1.txt, HDFS-8046.v1.patch
>
>
> On busy clusters, users performing quota checks against a big directory 
> structure can affect the namenode performance. It has become a lot better 
> after HDFS-4995, but as clusters get bigger and busier, it is apparent that 
> we need finer grain control to avoid long read lock causing throughput drop.
> Even with unfair namesystem lock setting, a long read lock (10s of 
> milliseconds) can starve many readers and especially writers. So the locking 
> duration should be reduced, which can be done by imposing a lower 
> count-per-iteration limit in the existing implementation.  But HDFS-4995 came 
> with a fixed amount of sleep between locks. This needs to be made 
> configurable, so that {{getContentSummary()}} doesn't get exceedingly slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6945) BlockManager should remove a block from excessReplicateMap and decrement ExcessBlocks metric when the block is removed

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-6945:
--
Fix Version/s: (was: 2.8.0)

> BlockManager should remove a block from excessReplicateMap and decrement 
> ExcessBlocks metric when the block is removed
> --
>
> Key: HDFS-6945
> URL: https://issues.apache.org/jira/browse/HDFS-6945
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.0
>Reporter: Akira AJISAKA
>Assignee: Akira AJISAKA
>Priority: Critical
>  Labels: metrics
> Fix For: 2.7.2, 2.6.4
>
> Attachments: HDFS-6945-003.patch, HDFS-6945-004.patch, 
> HDFS-6945-005.patch, HDFS-6945.2.patch, HDFS-6945.patch
>
>
> I'm seeing ExcessBlocks metric increases to more than 300K in some clusters, 
> however, there are no over-replicated blocks (confirmed by fsck).
> After a further research, I noticed when deleting a block, BlockManager does 
> not remove the block from excessReplicateMap or decrement excessBlocksCount.
> Usually the metric is decremented when processing block report, however, if 
> the block has been deleted, BlockManager does not remove the block from 
> excessReplicateMap or decrement the metric.
> That way the metric and excessReplicateMap can increase infinitely (i.e. 
> memory leak can occur).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8817) Make StorageType for Volumes in DataNode visible through JMX

2016-02-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8817:
--
Fix Version/s: (was: 2.8.0)

> Make StorageType for Volumes in DataNode visible through JMX
> 
>
> Key: HDFS-8817
> URL: https://issues.apache.org/jira/browse/HDFS-8817
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Attachments: HDFS-8817.001.patch
>
>
> StorageTypes are part of Volumes on DataNodes. Right now VolumeInfo does not 
> contain the StorageType Info in the {{VolumeInfo}}.  This JIRA proposes to 
> expose that info through VolumeInfo JSON.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8344) NameNode doesn't recover lease for files with missing blocks

2016-02-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8344:
--
Fix Version/s: (was: 2.8.0)

Removing fix-version given the patch was reverted and waiting final-commit.

> NameNode doesn't recover lease for files with missing blocks
> 
>
> Key: HDFS-8344
> URL: https://issues.apache.org/jira/browse/HDFS-8344
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: HDFS-8344.01.patch, HDFS-8344.02.patch, 
> HDFS-8344.03.patch, HDFS-8344.04.patch, HDFS-8344.05.patch, 
> HDFS-8344.06.patch, HDFS-8344.07.patch, HDFS-8344.08.patch, 
> HDFS-8344.09.patch, HDFS-8344.10.patch, TestHadoop.java
>
>
> I found another\(?) instance in which the lease is not recovered. This is 
> reproducible easily on a pseudo-distributed single node cluster
> # Before you start it helps if you set. This is not necessary, but simply 
> reduces how long you have to wait
> {code}
>   public static final long LEASE_SOFTLIMIT_PERIOD = 30 * 1000;
>   public static final long LEASE_HARDLIMIT_PERIOD = 2 * 
> LEASE_SOFTLIMIT_PERIOD;
> {code}
> # Client starts to write a file. (could be less than 1 block, but it hflushed 
> so some of the data has landed on the datanodes) (I'm copying the client code 
> I am using. I generate a jar and run it using $ hadoop jar TestHadoop.jar)
> # Client crashes. (I simulate this by kill -9 the $(hadoop jar 
> TestHadoop.jar) process after it has printed "Wrote to the bufferedWriter"
> # Shoot the datanode. (Since I ran on a pseudo-distributed cluster, there was 
> only 1)
> I believe the lease should be recovered and the block should be marked 
> missing. However this is not happening. The lease is never recovered.
> The effect of this bug for us was that nodes could not be decommissioned 
> cleanly. Although we knew that the client had crashed, the Namenode never 
> released the leases (even after restarting the Namenode) (even months 
> afterwards). There are actually several other cases too where we don't 
> consider what happens if ALL the datanodes die while the file is being 
> written, but I am going to punt on that for another time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7425) NameNode block deletion logging uses incorrect appender.

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7425:
--
Labels: 2.6.1-candidate  (was: )

> NameNode block deletion logging uses incorrect appender.
> 
>
> Key: HDFS-7425
> URL: https://issues.apache.org/jira/browse/HDFS-7425
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1
>
> Attachments: HDFS-7425-branch-2.1.patch
>
>
> The NameNode uses 2 separate Log4J appenders for tracking state changes.  The 
> appenders are named "org.apache.hadoop.hdfs.StateChange" and 
> "BlockStateChange".  The intention of BlockStateChange is to separate more 
> verbose block state change logging and allow it to be configured separately.  
> In branch-2, there is some block state change logging that incorrectly goes 
> to the "org.apache.hadoop.hdfs.StateChange" appender though.  The bug is not 
> present in trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7443) Datanode upgrade to BLOCKID_BASED_LAYOUT fails if duplicate block files are present in the same volume

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7443:
--
Labels: 2.6.1-candidate  (was: )

> Datanode upgrade to BLOCKID_BASED_LAYOUT fails if duplicate block files are 
> present in the same volume
> --
>
> Key: HDFS-7443
> URL: https://issues.apache.org/jira/browse/HDFS-7443
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Kihwal Lee
>Assignee: Colin Patrick McCabe
>Priority: Blocker
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1
>
> Attachments: HDFS-7443.001.patch, HDFS-7443.002.patch
>
>
> When we did an upgrade from 2.5 to 2.6 in a medium size cluster, about 4% of 
> datanodes were not coming up.  They treid data file layout upgrade for 
> BLOCKID_BASED_LAYOUT introduced in HDFS-6482, but failed.
> All failures were caused by {{NativeIO.link()}} throwing IOException saying 
> {{EEXIST}}.  The data nodes didn't die right away, but the upgrade was soon 
> retried when the block pool initialization was retried whenever 
> {{BPServiceActor}} was registering with the namenode.  After many retries, 
> datenodes terminated.  This would leave {{previous.tmp}} and {{current}} with 
> no {{VERSION}} file in the block pool slice storage directory.  
> Although {{previous.tmp}} contained the old {{VERSION}} file, the content was 
> in the new layout and the subdirs were all newly created ones.  This 
> shouldn't have happened because the upgrade-recovery logic in {{Storage}} 
> removes {{current}} and renames {{previous.tmp}} to {{current}} before 
> retrying.  All successfully upgraded volumes had old state preserved in their 
> {{previous}} directory.
> In summary there were two observed issues.
> - Upgrade failure with {{link()}} failing with {{EEXIST}}
> - {{previous.tmp}} contained not the content of original {{current}}, but 
> half-upgraded one.
> We did not see this in smaller scale test clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7489) Incorrect locking in FsVolumeList#checkDirs can hang datanodes

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7489:
--
Labels: 2.6.1-candidate  (was: )

> Incorrect locking in FsVolumeList#checkDirs can hang datanodes
> --
>
> Key: HDFS-7489
> URL: https://issues.apache.org/jira/browse/HDFS-7489
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.5.0, 2.6.0
>Reporter: Noah Lorang
>Assignee: Noah Lorang
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1
>
> Attachments: HDFS-7489-v1.patch, HDFS-7489-v2.patch, 
> HDFS-7489-v2.patch.1
>
>
> Starting after upgrading to 2.5.0 (CDH 5.2.1), we started to see datanodes 
> hanging their heartbeat and requests from clients. After some digging, I 
> identified the culprit as being the checkDiskError() triggered by catching 
> IOExceptions (in our case, SocketExceptions being triggered on one datanode 
> by ReplicaAlreadyExistsExceptions on another datanode).
> Thread dumps reveal that the checkDiskErrors() thread is holding a lock on 
> the FsVolumeList:
> {code}
> "Thread-409" daemon prio=10 tid=0x7f4e50200800 nid=0x5b8e runnable 
> [0x7f4e2f855000]
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.list(Native Method)
> at java.io.File.list(File.java:973)
> at java.io.File.listFiles(File.java:1051)
> at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:89)
> at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
> at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:257)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:210)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:180)
> - locked <0x00063b182ea0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:1396)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2832)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> Other things would then lock the FsDatasetImpl while waiting for the 
> FsVolumeList, e.g.:
> {code}
> "DataXceiver for client  at /10.10.0.52:46643 [Receiving block 
> BP-1573746465-127.0.1.1-1352244533715:blk_1073770670_106962574]" daemon 
> prio=10 tid=0x7f4e55561000 nid=0x406d waiting for monitor entry 
> [0x7f4e3106d000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.getNextVolume(FsVolumeList.java:64)
> - waiting to lock <0x00063b182ea0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:927)
> - locked <0x00063b1f9a48> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:101)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:167)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:604)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:72)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> That lock on the FsDatasetImpl then causes other threads to block:
> {code}
> "Thread-127" daemon prio=10 tid=0x7f4e4c67d800 nid=0x2e02 waiting for 
> monitor entry [0x7f4e3339]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:228)
> - waiting to lock <0x00063b1f9a48> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:436)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:523)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:684)
>   

[jira] [Updated] (HDFS-7503) Namenode restart after large deletions can cause slow processReport (due to logging)

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7503:
--
Labels: 2.6.1-candidate  (was: )

> Namenode restart after large deletions can cause slow processReport (due to 
> logging)
> 
>
> Key: HDFS-7503
> URL: https://issues.apache.org/jira/browse/HDFS-7503
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.2.1, 2.6.0
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>  Labels: 2.6.1-candidate
> Fix For: 1.3.0, 2.6.1
>
> Attachments: HDFS-7503.branch-1.02.patch, HDFS-7503.branch-1.patch, 
> HDFS-7503.trunk.01.patch, HDFS-7503.trunk.02.patch
>
>
> If a large directory is deleted and namenode is immediately restarted, there 
> are a lot of blocks that do not belong to any file. This results in a log:
> {code}
> 2014-11-08 03:11:45,584 INFO BlockStateChange 
> (BlockManager.java:processReport(1901)) - BLOCK* processReport: 
> blk_1074250282_509532 on 172.31.44.17:1019 size 6 does not belong to any file.
> {code}
> This log is printed within FSNamsystem lock. This can cause namenode to take 
> long time in coming out of safemode.
> One solution is to downgrade the logging level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7575) Upgrade should generate a unique storage ID for each volume

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7575:
--
Labels: 2.6.1-candidate  (was: )

> Upgrade should generate a unique storage ID for each volume
> ---
>
> Key: HDFS-7575
> URL: https://issues.apache.org/jira/browse/HDFS-7575
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.4.0, 2.5.0, 2.6.0
>Reporter: Lars Francke
>Assignee: Arpit Agarwal
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0
>
> Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
> HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
> HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
> testUpgrade22via24GeneratesStorageIDs.tgz, 
> testUpgradeFrom22GeneratesStorageIDs.tgz, 
> testUpgradeFrom24PreservesStorageId.tgz
>
>
> Before HDFS-2832 each DataNode would have a unique storageId which included 
> its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
> storage directory which is just a random UUID.
> They send reports per storage directory in their heartbeats. This heartbeat 
> is processed on the NameNode in the 
> {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
> just store the information per Datanode. After the patch though each DataNode 
> can have multiple different storages so it's stored in a map keyed by the 
> storage Id.
> This works fine for all clusters that have been installed post HDFS-2832 as 
> they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
> different keys. On each Heartbeat the Map is searched and updated 
> ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
> {code:title=DatanodeStorageInfo}
>   void updateState(StorageReport r) {
> capacity = r.getCapacity();
> dfsUsed = r.getDfsUsed();
> remaining = r.getRemaining();
> blockPoolUsed = r.getBlockPoolUsed();
>   }
> {code}
> On clusters that were upgraded from a pre HDFS-2832 version though the 
> storage Id has not been rewritten (at least not on the four clusters I 
> checked) so each directory will have the exact same storageId. That means 
> there'll be only a single entry in the {{storageMap}} and it'll be 
> overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
> in the {{updateState}} method above. This just assigns the capacity from the 
> received report, instead it should probably sum it up per received heartbeat.
> The Balancer seems to be one of the only things that actually uses this 
> information so it now considers the utilization of a random drive per 
> DataNode for balancing purposes.
> Things get even worse when a drive has been added or replaced as this will 
> now get a new storage Id so there'll be two entries in the storageMap. As new 
> drives are usually empty it skewes the balancers decision in a way that this 
> node will never be considered over-utilized.
> Another problem is that old StorageReports are never removed from the 
> storageMap. So if I replace a drive and it gets a new storage Id the old one 
> will still be in place and used for all calculations by the Balancer until a 
> restart of the NameNode.
> I can try providing a patch that does the following:
> * Instead of using a Map I could just store the array we receive or instead 
> of storing an array sum up the values for reports with the same Id
> * On each heartbeat clear the map (so we know we have up to date information)
> Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7579) Improve log reporting during block report rpc failure

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7579:
--
Labels: 2.6.1-candidate supportability  (was: supportability)

> Improve log reporting during block report rpc failure
> -
>
> Key: HDFS-7579
> URL: https://issues.apache.org/jira/browse/HDFS-7579
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.7.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
>  Labels: 2.6.1-candidate, supportability
> Fix For: 2.7.0
>
> Attachments: HDFS-7579.000.patch, HDFS-7579.001.patch
>
>
> During block reporting, if the block report RPC fails, for example because it 
> exceeded the max rpc len, we should still produce some sort of LOG.info 
> output to help with debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7596) NameNode should prune dead storages from storageMap

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7596:
--
Labels: 2.6.1-candidate  (was: )

> NameNode should prune dead storages from storageMap
> ---
>
> Key: HDFS-7596
> URL: https://issues.apache.org/jira/browse/HDFS-7596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0
>
> Attachments: HDFS-7596.01.patch, HDFS-7596.02.patch
>
>
> The NameNode must be able to prune storages that are no longer reported by 
> the DataNode and that have no blocks associated. These stale storages can 
> skew the balancer behavior.
> Detailed discussion on HDFS-7575.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7733) NFS: readdir/readdirplus return null directory attribute on failure

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7733:
--
Labels: 2.6.1-candidate  (was: )

> NFS: readdir/readdirplus return null directory attribute on failure
> ---
>
> Key: HDFS-7733
> URL: https://issues.apache.org/jira/browse/HDFS-7733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1
>
> Attachments: HDFS-7733.01.patch
>
>
> NFS readdir and readdirplus operations return a null directory attribute on 
> some failure paths. This causes clients to get a 'Stale file handle' error 
> which can only be fixed by unmounting and remounting the share.
> The issue can be reproduced by running 'ls' against a large directory which 
> is being actively modified, triggering the 'cookie mismatch' failure path.
> {code}
> } else {
>   LOG.error("cookieverf mismatch. request cookieverf: " + cookieVerf
>   + " dir cookieverf: " + dirStatus.getModificationTime());
>   return new READDIRPLUS3Response(Nfs3Status.NFS3ERR_BAD_COOKIE);
> }
> {code}
> Thanks to [~brandonli] for catching the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7885) Datanode should not trust the generation stamp provided by client

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7885:
--
Labels: 2.6.1-candidate  (was: )

> Datanode should not trust the generation stamp provided by client
> -
>
> Key: HDFS-7885
> URL: https://issues.apache.org/jira/browse/HDFS-7885
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.2.0
>Reporter: vitthal (Suhas) Gogate
>Assignee: Tsz Wo Nicholas Sze
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0
>
> Attachments: h7885_20150305.patch, h7885_20150306.patch
>
>
> Datanode should not trust the generation stamp provided by client, since it 
> is prefetched and buffered in client, and concurrent append may increase it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7831) Fix the starting index and end condition of the loop in FileDiffList.findEarlierSnapshotBlocks()

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7831:
--
Labels: 2.6.1-candidate  (was: )

> Fix the starting index and end condition of the loop in 
> FileDiffList.findEarlierSnapshotBlocks()
> 
>
> Key: HDFS-7831
> URL: https://issues.apache.org/jira/browse/HDFS-7831
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0
>
> Attachments: HDFS-7831-01.patch
>
>
> Currently the loop in {{FileDiffList.findEarlierSnapshotBlocks()}} starts 
> from {{insertPoint + 1}}. It should start from {{insertPoint - 1}}. As noted 
> in [Jing's 
> comment|https://issues.apache.org/jira/browse/HDFS-7056?focusedCommentId=14333864&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14333864]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8127) NameNode Failover during HA upgrade can cause DataNode to finalize upgrade

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8127:
--
Labels: 2.6.1-candidate  (was: )

> NameNode Failover during HA upgrade can cause DataNode to finalize upgrade
> --
>
> Key: HDFS-8127
> URL: https://issues.apache.org/jira/browse/HDFS-8127
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Jing Zhao
>Assignee: Jing Zhao
>Priority: Blocker
>  Labels: 2.6.1-candidate
> Fix For: 2.7.1
>
> Attachments: HDFS-8127.000.patch, HDFS-8127.001.patch
>
>
> Currently for HA upgrade (enabled by HDFS-5138), we use {{-bootstrapStandby}} 
> to initialize the standby NameNode. The standby NameNode does not have the 
> {{previous}} directory thus it does not know that the cluster is in the 
> upgrade state. If NN failover happens, as response of block reports, the new 
> ANN will tell DNs to finalize the upgrade thus make it impossible to rollback 
> again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7960) The full block report should prune zombie storages even if they're not empty

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7960:
--
Labels: 2.6.1-candidate  (was: )

> The full block report should prune zombie storages even if they're not empty
> 
>
> Key: HDFS-7960
> URL: https://issues.apache.org/jira/browse/HDFS-7960
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Colin Patrick McCabe
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0
>
> Attachments: HDFS-7960.002.patch, HDFS-7960.003.patch, 
> HDFS-7960.004.patch, HDFS-7960.005.patch, HDFS-7960.006.patch, 
> HDFS-7960.007.patch, HDFS-7960.008.patch
>
>
> The full block report should prune zombie storages even if they're not empty. 
>  We have seen cases in production where zombie storages have not been pruned 
> subsequent to HDFS-7575.  This could arise any time the NameNode thinks there 
> is a block in some old storage which is actually not there.  In this case, 
> the block will not show up in the "new" storage (once old is renamed to new) 
> and the old storage will linger forever as a zombie, even with the HDFS-7596 
> fix applied.  This also happens with datanode hotplug, when a drive is 
> removed.  In this case, an entire storage (volume) goes away but the blocks 
> do not show up in another storage on the same datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8072) Reserved RBW space is not released if client terminates while writing block

2015-07-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8072:
--
Labels: 2.6.1-candidate  (was: )

> Reserved RBW space is not released if client terminates while writing block
> ---
>
> Key: HDFS-8072
> URL: https://issues.apache.org/jira/browse/HDFS-8072
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0
>
> Attachments: HDFS-8072.01.patch, HDFS-8072.02.patch
>
>
> The DataNode reserves space for a full block when creating an RBW block 
> (introduced in HDFS-6898).
> The reserved space is released incrementally as data is written to disk and 
> fully when the block is finalized. However if the client process terminates 
> unexpectedly mid-write then the reserved space is not released until the DN 
> is restarted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7871) NameNodeEditLogRoller can keep printing "Swallowing exception" message

2015-07-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7871:
--
Labels: 2.6.1-candidate  (was: )

> NameNodeEditLogRoller can keep printing "Swallowing exception" message
> --
>
> Key: HDFS-7871
> URL: https://issues.apache.org/jira/browse/HDFS-7871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Jing Zhao
>Assignee: Jing Zhao
>Priority: Critical
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0
>
> Attachments: HDFS-7871.000.patch
>
>
> In one of our customer's cluster, we saw the NameNode kept printing out the 
> following error msg and thus got stuck for several minutes:
> {code}
> ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Swallowing 
> exception in NameNodeEditLogRoller:
> java.lang.IllegalStateException: Bad state: BETWEEN_LOG_SEGMENTS
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:172)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.getCurSegmentTxId(FSEditLog.java:493)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeEditLogRoller.run(FSNamesystem.java:4358)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >