[jira] [Commented] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140277#comment-17140277 ] Yuxuan Wang commented on HDFS-15419: [~ayushtkn] Thanks for your reply. IIRC, now router will retry not only when catch StandbyException, but also some other exception like ConnectionTimeoutExcetpion IMO, we can improve the retry policy in router at least. And I think add more retry is not a good work in this jira. > RBF: Router should retry communicate with NN when cluster is unavailable > using configurable time interval > - > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140258#comment-17140258 ] Ayush Saxena edited comment on HDFS-15419 at 6/19/20, 6:48 AM: --- The present code is to have failover is because the router maintains the active/standby state of the namenode, in case if there is a change in roles of namenode which is different to that stored in Router, the router will failover and update the state. That way present code seems OK, Removal of that isn't required, If we remove that, in case a failover happens the router will keep on rejecting calls based on the old states in cache until the heartbeat updates. The present retry logic is to just ensure if there is an active namenode then it gets the call. If the router couldn't find it, It doesn't hold it. Then the client can decide whether to retry or not. I am not sure but if as proposed here, the router does a full retry like normal client, in worse situations the actual client may timeout. For the actual call it sent just one call and it is stuck at server, it won't be aware that the router is retrying to different namenodes and stuff Well IIRC we even had a logic added in router for the purpose of retry recently, that amongst all the exceptions received from the several Namespaces if one exception is retriable that only would get propagated so as client can retry. was (Author: ayushtkn): The present code is to have failover is because the router maintains the active/standby state of the namenode, in case if there is a change in roles of namenode which is different to that stored in Router, the router will failover and update the state. That way present code seems OK, Removal of that isn't required, If we remove that, in case a failover happens the router will keep on rejecting calls based on the old states in cache until the heartbeat updates. The present retry logic is to just ensure if there is an active namenode then it gets the call. If the router couldn't find it, It doesn't hold it. Then the client can decide whether to retry or not. I am not sure but if as proposed here, the router does a full retry like normal client, in worse situations the actual client may timeout. For the actual call it sent just one call and it is stuck at server, it won't be aware that the router is retrying to different namenodes and stuff > RBF: Router should retry communicate with NN when cluster is unavailable > using configurable time interval > - > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140258#comment-17140258 ] Ayush Saxena commented on HDFS-15419: - The present code is to have failover is because the router maintains the active/standby state of the namenode, in case if there is a change in roles of namenode which is different to that stored in Router, the router will failover and update the state. That way present code seems OK, Removal of that isn't required, If we remove that, in case a failover happens the router will keep on rejecting calls based on the old states in cache until the heartbeat updates. The present retry logic is to just ensure if there is an active namenode then it gets the call. If the router couldn't find it, It doesn't hold it. Then the client can decide whether to retry or not. I am not sure but if as proposed here, the router does a full retry like normal client, in worse situations the actual client may timeout. For the actual call it sent just one call and it is stuck at server, it won't be aware that the router is retrying to different namenodes and stuff > RBF: Router should retry communicate with NN when cluster is unavailable > using configurable time interval > - > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140249#comment-17140249 ] Yuxuan Wang commented on HDFS-15419: [~bhji123] Well, I more agree with [~ayushtkn]. And I think we should remove the retry code currently in router ranther than add more retry to it. I see [~elgoiri] review the PR. How do you think Saxena's comment? > RBF: Router should retry communicate with NN when cluster is unavailable > using configurable time interval > - > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140242#comment-17140242 ] bhji123 commented on HDFS-15419: hi, Yuxuan. In this case, if clients timeout and nn is still unavailable, then clients will retry. The difference is router will be more reliable, especially when clients not configured appropriately. > RBF: Router should retry communicate with NN when cluster is unavailable > using configurable time interval > - > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140240#comment-17140240 ] Jinglun commented on HDFS-15410: Hi [~elgoiri], thanks your nice comments ! Refer the fedbalance-site.xml at the doc(HDFS-15374 PR). Upload v02 fixing the typo: `fedbalance-site.xml` is mistyped as `distcp-site.xml`. > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15410: --- Attachment: HDFS-15410.002.patch > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch, HDFS-15410.002.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140176#comment-17140176 ] Hadoop QA commented on HDFS-15404: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 53s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 24m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 27m 43s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 8s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 45s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 17m 16s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 52s{color} | {color:orange} root: The patch generated 4 new + 23 unchanged - 0 fixed = 27 total (was 23) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 16s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 34s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 22s{color} | {color:red} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}118m 35s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 54s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}268m 17s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.ha.TestFailoverController | | | hadoop.ha.TestShellCommandFencer | | | hadoop.ha.TestNodeFencer | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics | | | hadoop.hdfs.tools.TestDFSHAAdminMiniCluster | | | hadoop.hdfs.TestReco
[jira] [Commented] (HDFS-14941) Potential editlog race condition can cause corrupted file
[ https://issues.apache.org/jira/browse/HDFS-14941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140156#comment-17140156 ] Kihwal Lee commented on HDFS-14941: --- This change causes incremental block report leak. The last set of IBRs from append is always considered from future and re-queued. Unless the file is appended again, those reports won't get processed. But again the IBRs for the latest append will leak. If this happens during a startup safe mode, the standby NN can never leave the safe mode on its own. I didn't test with truncate, but it might happen with truncate too. It is easy to see the last set of IBRs gettting re-queued after enabling debug logging in BlockManager. We first thought there is something wrong with the new safe mode implementation. But found out the baseline number of datanode pending messages is growing, which does not happen in 2.8. After enabling debug logging, we could see the IBRs for append getting re-queued rather than processed. Revert of this change fixed the issue. Regarding the original corruption issue you saw, we have seen something very similar too. After a failover, it suddenly had missing blocks due to corruption. But the corruption reason was recorded as "size mismatch" in our case. Of course, the actual data was fine. We haven't seen it happening again after the fix, but it is rare anyway. The main part of the fix we did is, {code} @@ -2578,10 +2578,7 @@ private BlockInfo processReportedBlock( // If the block is an out-of-date generation stamp or state, // but we're the standby, we shouldn't treat it as corrupt, // but instead just queue it for later processing. -// TODO: Pretty confident this should be s/storedBlock/block below, -// since we should be postponing the info of the reported block, not -// the stored block. See HDFS-6289 for more context. -queueReportedBlock(storageInfo, storedBlock, reportedState, +queueReportedBlock(storageInfo, block, reportedState, QUEUE_REASON_CORRUPT_STATE); } else { toCorrupt.add(c); {code} We wanted get more run time before reporting to the community. This is only place where wrong size is queued with an IBR in append or truncate, because it queues using stored block, not reported one. I wonder why it was left like that all these years, despite the suspicious comment. > Potential editlog race condition can cause corrupted file > - > > Key: HDFS-14941 > URL: https://issues.apache.org/jira/browse/HDFS-14941 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Labels: ha > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-14941.001.patch, HDFS-14941.002.patch, > HDFS-14941.003.patch, HDFS-14941.004.patch, HDFS-14941.005.patch, > HDFS-14941.006.patch > > > Recently we encountered an issue that, after a failover, NameNode complains > corrupted file/missing blocks. The blocks did recover after full block > reports, so the blocks are not actually missing. After further investigation, > we believe this is what happened: > First of all, on SbN, it is possible that it receives block reports before > corresponding edit tailing happened. In which case SbN postpones processing > the DN block report, handled by the guarding logic below: > {code:java} > if (shouldPostponeBlocksFromFuture && > namesystem.isGenStampInFuture(iblk)) { > queueReportedBlock(storageInfo, iblk, reportedState, > QUEUE_REASON_FUTURE_GENSTAMP); > continue; > } > {code} > Basically if reported block has a future generation stamp, the DN report gets > requeued. > However, in {{FSNamesystem#storeAllocatedBlock}}, we have the following code: > {code:java} > // allocate new block, record block locations in INode. > newBlock = createNewBlock(); > INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile); > saveAllocatedBlock(src, inodesInPath, newBlock, targets); > persistNewBlock(src, pendingFile); > offset = pendingFile.computeFileSize(); > {code} > The line > {{newBlock = createNewBlock();}} > Would log an edit entry {{OP_SET_GENSTAMP_V2}} to bump generation stamp on > Standby > while the following line > {{persistNewBlock(src, pendingFile);}} > would log another edit entry {{OP_ADD_BLOCK}} to actually add the block on > Standby. > Then the race condition is that, imagine Standby has just processed > {{OP_SET_GENSTAMP_V2}}, but not yet {{OP_ADD_BLOCK}} (if they just happen to > be in different setment). Now a block report with new generation stamp comes > in. > Since the genstamp bump has
[jira] [Updated] (HDFS-15416) DataStorage#addStorageLocations() should add more reasonable information verification.
[ https://issues.apache.org/jira/browse/HDFS-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jianghua zhu updated HDFS-15416: Summary: DataStorage#addStorageLocations() should add more reasonable information verification. (was: The addStorageLocations() method in the DataStorage class is not perfect.) > DataStorage#addStorageLocations() should add more reasonable information > verification. > -- > > Key: HDFS-15416 > URL: https://issues.apache.org/jira/browse/HDFS-15416 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.0, 3.1.1 >Reporter: jianghua zhu >Assignee: jianghua zhu >Priority: Major > Attachments: HDFS-15416.patch > > > SuccessLocations content is an array, when the number is 0, do not need to be > executed again loadBlockPoolSliceStorage (). > code : > try > { > final List successLocations = loadDataStorage( datanode, > nsInfo, dataDirs, startOpt, executor); > return loadBlockPoolSliceStorage( datanode, nsInfo, successLocations, > startOpt, executor); } > finally > { executor.shutdown(); } > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140052#comment-17140052 ] Chen Liang commented on HDFS-15404: --- Upload v002 patch to fix the bug that caused failed tests. The bug is that parseArgs should allow cmd only having command, in which case both src and dst will execute the same command/script > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Attachment: HDFS-15404.002.patch > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15378) TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on trunk
[ https://issues.apache.org/jira/browse/HDFS-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139990#comment-17139990 ] Íñigo Goiri commented on HDFS-15378: +1 on [^HDFS-15378.001.patch]. > TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on > trunk > - > > Key: HDFS-15378 > URL: https://issues.apache.org/jira/browse/HDFS-15378 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Priority: Major > Attachments: HDFS-15378.001.patch > > > [https://builds.apache.org/job/PreCommit-HDFS-Build/29377/#showFailuresLink] > [https://builds.apache.org/job/PreCommit-HDFS-Build/29368/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14546) Document block placement policies
[ https://issues.apache.org/jira/browse/HDFS-14546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139991#comment-17139991 ] Íñigo Goiri commented on HDFS-14546: +1 on [^HDFS-14546-09.patch]. > Document block placement policies > - > > Key: HDFS-14546 > URL: https://issues.apache.org/jira/browse/HDFS-14546 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Íñigo Goiri >Assignee: Amithsha >Priority: Major > Labels: documentation > Attachments: HDFS-14546-01.patch, HDFS-14546-02.patch, > HDFS-14546-03.patch, HDFS-14546-04.patch, HDFS-14546-05.patch, > HDFS-14546-06.patch, HDFS-14546-07.patch, HDFS-14546-08.patch, > HDFS-14546-09.patch, HdfsDesign.patch > > > Currently, all the documentation refers to the default block placement policy. > However, over time there have been new policies: > * BlockPlacementPolicyRackFaultTolerant (HDFS-7891) > * BlockPlacementPolicyWithNodeGroup (HDFS-3601) > * BlockPlacementPolicyWithUpgradeDomain (HDFS-9006) > We should update the documentation to refer to them explaining their > particularities and probably how to setup each one of them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15416) The addStorageLocations() method in the DataStorage class is not perfect.
[ https://issues.apache.org/jira/browse/HDFS-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139988#comment-17139988 ] Íñigo Goiri commented on HDFS-15416: Please update the title to be a little more specific. > The addStorageLocations() method in the DataStorage class is not perfect. > - > > Key: HDFS-15416 > URL: https://issues.apache.org/jira/browse/HDFS-15416 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.0, 3.1.1 >Reporter: jianghua zhu >Assignee: jianghua zhu >Priority: Major > Attachments: HDFS-15416.patch > > > SuccessLocations content is an array, when the number is 0, do not need to be > executed again loadBlockPoolSliceStorage (). > code : > try > { > final List successLocations = loadDataStorage( datanode, > nsInfo, dataDirs, startOpt, executor); > return loadBlockPoolSliceStorage( datanode, nsInfo, successLocations, > startOpt, executor); } > finally > { executor.shutdown(); } > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15418) ViewFileSystemOverloadScheme should represent mount links as non symlinks
[ https://issues.apache.org/jira/browse/HDFS-15418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G updated HDFS-15418: --- Status: Patch Available (was: Open) > ViewFileSystemOverloadScheme should represent mount links as non symlinks > - > > Key: HDFS-15418 > URL: https://issues.apache.org/jira/browse/HDFS-15418 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > > Currently ViewFileSystemOverloadScheme uses ViewFileSystem default behavior. > ViewFS represents the mount links as symlinks always. Since > ViewFSOverloadScheme, we can have any scheme, and that scheme fs does not > have symlinks, ViewFs behavior symlinks can confuse. > So, here I propose to represent mount links as non symlinks in > ViewFSOverloadScheme -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15418) ViewFileSystemOverloadScheme should represent mount links as non symlinks
[ https://issues.apache.org/jira/browse/HDFS-15418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139983#comment-17139983 ] Uma Maheswara Rao G edited comment on HDFS-15418 at 6/18/20, 8:46 PM: -- Updated PR for review! By default ViewFileSystem represents mount links as symlinks. Many deployment does not really developed symlink aware applications as ListStatus behaves little differently when we have symlinks. With OverloadScheme, if application build on other fs where no symlinks handling, then application see some different behaviors. Ex: HADOOP-17024 However, HADOOP-17029 was attempted to fix some of that incompatibilities. But changing existing behaviors would create incompatibility issues. So, the idea is to introduce advanced config to disable as symlink assumption in ViewFileSystem#listStatus. Bydefault, it's enabled as symlinks in ViewFileSystem. If one wants to disable, please set fs.viewfs.mount.links.as.symlinks to false. In ViewFileSystemOverloadScheme, by default it's false as we tend to work as any other HCFS filesystem and many of them might not have symlinks. If one wants to to see them same as ViewFileSystem, please set fs.viewfs.mount.links.as.symlinks to true. This is an advanced and non advertised property. CC: [~abhishekd] please check if this works fine in your scenarios as this is slightly modified behavior from HADOOP-17029. was (Author: umamaheswararao): Updated PR for review! By default ViewFileSystem represents mount links as symlinks. Many deployment does not really developed symlink aware applications as ListStatus behaves little differently when we have symlinks. With OverloadScheme, if application build on other fs where no symlinks handling, then application see some different behaviors. Ex: HADOOP-17024 However, HADOOP-17029 was attempted to fix some of that incompatibilities. But changing existing behaviors would create incompatibility issues. So, the idea is to introduce advanced config to disable as symlink assumption in ViewFileSystem#listStatus. Bydefault, it's enabled as symlinks in ViewFileSystem. If one wants to disable, please set fs.viewfs.mount.links.as.symlinks to false. In ViewFileSystemOverloadScheme, by default it's false as we tend to work as any other HCFS filesystem and many of them might not have symlinks. If one wants to to see them same as ViewFileSystem, please set fs.viewfs.mount.links.as.symlinks to true. This is an advanced and non advertised property. > ViewFileSystemOverloadScheme should represent mount links as non symlinks > - > > Key: HDFS-15418 > URL: https://issues.apache.org/jira/browse/HDFS-15418 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > > Currently ViewFileSystemOverloadScheme uses ViewFileSystem default behavior. > ViewFS represents the mount links as symlinks always. Since > ViewFSOverloadScheme, we can have any scheme, and that scheme fs does not > have symlinks, ViewFs behavior symlinks can confuse. > So, here I propose to represent mount links as non symlinks in > ViewFSOverloadScheme -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15418) ViewFileSystemOverloadScheme should represent mount links as non symlinks
[ https://issues.apache.org/jira/browse/HDFS-15418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139983#comment-17139983 ] Uma Maheswara Rao G commented on HDFS-15418: Updated PR for review! By default ViewFileSystem represents mount links as symlinks. Many deployment does not really developed symlink aware applications as ListStatus behaves little differently when we have symlinks. With OverloadScheme, if application build on other fs where no symlinks handling, then application see some different behaviors. Ex: HADOOP-17024 However, HADOOP-17029 was attempted to fix some of that incompatibilities. But changing existing behaviors would create incompatibility issues. So, the idea is to introduce advanced config to disable as symlink assumption in ViewFileSystem#listStatus. Bydefault, it's enabled as symlinks in ViewFileSystem. If one wants to disable, please set fs.viewfs.mount.links.as.symlinks to false. In ViewFileSystemOverloadScheme, by default it's false as we tend to work as any other HCFS filesystem and many of them might not have symlinks. If one wants to to see them same as ViewFileSystem, please set fs.viewfs.mount.links.as.symlinks to true. This is an advanced and non advertised property. > ViewFileSystemOverloadScheme should represent mount links as non symlinks > - > > Key: HDFS-15418 > URL: https://issues.apache.org/jira/browse/HDFS-15418 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > > Currently ViewFileSystemOverloadScheme uses ViewFileSystem default behavior. > ViewFS represents the mount links as symlinks always. Since > ViewFSOverloadScheme, we can have any scheme, and that scheme fs does not > have symlinks, ViewFs behavior symlinks can confuse. > So, here I propose to represent mount links as non symlinks in > ViewFSOverloadScheme -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15419) RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-15419: --- Summary: RBF: Router should retry communicate with NN when cluster is unavailable using configurable time interval (was: Router should retry communicate with NN when cluster is unavailable using configurable time interval) > RBF: Router should retry communicate with NN when cluster is unavailable > using configurable time interval > - > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139961#comment-17139961 ] Íñigo Goiri commented on HDFS-15410: I think we should refer to this in the documentation too. > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15417) RBF: Lazy get the datanode report for federation WebHDFS operations
[ https://issues.apache.org/jira/browse/HDFS-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-15417: --- Summary: RBF: Lazy get the datanode report for federation WebHDFS operations (was: Lazy get the datanode report for federation WebHDFS operations) > RBF: Lazy get the datanode report for federation WebHDFS operations > --- > > Key: HDFS-15417 > URL: https://issues.apache.org/jira/browse/HDFS-15417 > Project: Hadoop HDFS > Issue Type: Improvement > Components: federation, rbf, webhdfs >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > > *Why* > For WebHDFS CREATE, OPEN, APPEND and GETFILECHECKSUM operations, router or > namenode needs to get the datanodes where the block is located, then redirect > the request to one of the datanodes. > However, this chooseDatanode action in router is much slower than namenode, > which directly affects the WebHDFS operations above. > For namenode WebHDFS, it normally takes tens of milliseconds, while router > always takes more than 2 seconds. > *How* > Only get the datanode report when necessary in router. It is a very expense > operation where all the time is spent on. > This is only needed when we want to exclude some datanodes or find a random > datanode for CREATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15417) Lazy get the datanode report for federation WebHDFS operations
[ https://issues.apache.org/jira/browse/HDFS-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139554#comment-17139554 ] Chao Sun commented on HDFS-15417: - I think this addresses the same issue in HDFS-15014. Internally we were trying to use the cached DN reports but those are tied with Router metrics and the implementation is kind of messy. > Lazy get the datanode report for federation WebHDFS operations > -- > > Key: HDFS-15417 > URL: https://issues.apache.org/jira/browse/HDFS-15417 > Project: Hadoop HDFS > Issue Type: Improvement > Components: federation, rbf, webhdfs >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > > *Why* > For WebHDFS CREATE, OPEN, APPEND and GETFILECHECKSUM operations, router or > namenode needs to get the datanodes where the block is located, then redirect > the request to one of the datanodes. > However, this chooseDatanode action in router is much slower than namenode, > which directly affects the WebHDFS operations above. > For namenode WebHDFS, it normally takes tens of milliseconds, while router > always takes more than 2 seconds. > *How* > Only get the datanode report when necessary in router. It is a very expense > operation where all the time is spent on. > This is only needed when we want to exclude some datanodes or find a random > datanode for CREATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13965) hadoop.security.kerberos.ticket.cache.path setting is not honored when KMS encryption is enabled.
[ https://issues.apache.org/jira/browse/HDFS-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139552#comment-17139552 ] LOKESKUMAR VIJAYAKUMAR commented on HDFS-13965: --- Hello Team! Can anyone please help here? > hadoop.security.kerberos.ticket.cache.path setting is not honored when KMS > encryption is enabled. > - > > Key: HDFS-13965 > URL: https://issues.apache.org/jira/browse/HDFS-13965 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, kms >Affects Versions: 2.7.3, 2.7.7 >Reporter: LOKESKUMAR VIJAYAKUMAR >Assignee: Kitti Nanasi >Priority: Major > > _We use the *+hadoop.security.kerberos.ticket.cache.path+* setting to provide > a custom kerberos cache path for all hadoop operations to be run as specified > user. But this setting is not honored when KMS encryption is enabled._ > _The below program to read a file works when KMS encryption is not enabled, > but it fails when the KMS encryption is enabled._ > _Looks like *hadoop.security.kerberos.ticket.cache.path* setting is not > honored by *createConnection on KMSClientProvider.java.*_ > > HadoopTest.java (CLASSPATH needs to be set to compile and run) > > import java.io.InputStream; > import java.net.URI; > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.Path; > > public class HadoopTest { > public static int runRead(String[] args) throws Exception{ > if (args.length < 3) { > System.err.println("HadoopTest hadoop_file_path > hadoop_user kerberos_cache"); > return 1; > } > Path inputPath = new Path(args[0]); > Configuration conf = new Configuration(); > URI defaultURI = FileSystem.getDefaultUri(conf); > > conf.set("hadoop.security.kerberos.ticket.cache.path",args[2]); > FileSystem fs = > FileSystem.newInstance(defaultURI,conf,args[1]); > InputStream is = fs.open(inputPath); > byte[] buffer = new byte[4096]; > int nr = is.read(buffer); > while (nr != -1) > { > System.out.write(buffer, 0, nr); > nr = is.read(buffer); > } > return 0; > } > public static void main( String[] args ) throws Exception { > int returnCode = HadoopTest.runRead(args); > System.exit(returnCode); > } > } > > > > [root@lstrost3 testhadoop]# pwd > /testhadoop > > [root@lstrost3 testhadoop]# ls > HadoopTest.java > > [root@lstrost3 testhadoop]# export CLASSPATH=`hadoop classpath --glob`:. > > [root@lstrost3 testhadoop]# javac HadoopTest.java > > [root@lstrost3 testhadoop]# java HadoopTest > HadoopTest hadoop_file_path hadoop_user kerberos_cache > > [root@lstrost3 testhadoop]# java HadoopTest /loki/loki.file loki > /tmp/krb5cc_1006 > 18/09/27 23:23:20 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 18/09/27 23:23:21 WARN shortcircuit.DomainSocketFactory: The short-circuit > local reads feature cannot be used because libhadoop cannot be loaded. > Exception in thread "main" java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: *{color:#FF}No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt){color}* > at > {color:#FF}*org.apache.hadoop.crypto.key.kms.KMSClientProvider.createConnection(KMSClientProvider.java:551)*{color} > at > org.apache.hadoop.crypto.key.kms.KMSClientProvider.decryptEncryptedKey(KMSClientProvider.java:831) > at > org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388) > at > org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1393) > at > org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1463) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:333) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786) > at HadoopTest.runRead(HadoopTest.java:18) > at HadoopTest.main(HadoopTest.j
[jira] [Commented] (HDFS-15420) approx scheduled blocks not reseting over time
[ https://issues.apache.org/jira/browse/HDFS-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139468#comment-17139468 ] hemanthboyina commented on HDFS-15420: -- thanks [~maxmzkr] for providing the report , a quick question are there any pending reconstruction requests that are timed out? > approx scheduled blocks not reseting over time > -- > > Key: HDFS-15420 > URL: https://issues.apache.org/jira/browse/HDFS-15420 > Project: Hadoop HDFS > Issue Type: Bug > Components: block placement >Affects Versions: 2.6.0, 3.0.0 > Environment: Our 2.6.0 environment is a 3 node cluster running > cdh5.15.0. > Our 3.0.0 environment is a 4 node cluster running cdh6.3.0. >Reporter: Max Mizikar >Priority: Minor > Attachments: Screenshot from 2020-06-18 09-29-57.png, Screenshot from > 2020-06-18 09-31-15.png > > > We have been experiencing large amounts of scheduled blocks that never get > cleared out. This is preventing blocks from being placed even when there is > plenty of space on the system. > Here is an example of the block growth over 24 hours on one of our systems > running 2.6.0 > !Screenshot from 2020-06-18 09-29-57.png! > Here is an example of the block growth over 24 hours on one of our systems > running 3.0.0 > !Screenshot from 2020-06-18 09-31-15.png! > https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue > we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, > however, there appears to still be a systemic growth in scheduled blocks over > time and our systems will still need to restart the namenode on occasion to > reset this count. I have not determined what is causing the leaked blocks in > 3.0.0. > Looking into the issue, I discovered that the intention is for scheduled > blocks to slowly go back down to 0 after errors cause blocks to be leaked. > {code} > /** Increment the number of blocks scheduled. */ > void incrementBlocksScheduled(StorageType t) { > currApproxBlocksScheduled.add(t, 1); > } > > /** Decrement the number of blocks scheduled. */ > void decrementBlocksScheduled(StorageType t) { > if (prevApproxBlocksScheduled.get(t) > 0) { > prevApproxBlocksScheduled.subtract(t, 1); > } else if (currApproxBlocksScheduled.get(t) > 0) { > currApproxBlocksScheduled.subtract(t, 1); > } > // its ok if both counters are zero. > } > > /** Adjusts curr and prev number of blocks scheduled every few minutes. */ > private void rollBlocksScheduled(long now) { > if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) { > prevApproxBlocksScheduled.set(currApproxBlocksScheduled); > currApproxBlocksScheduled.reset(); > lastBlocksScheduledRollTime = now; > } > } > {code} > However, this code does not do what is intended if the system has a constant > flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the > next scheduled block increments currApproxBlocksScheduled and when it > completes, it decrements prevApproxBlocksScheduled preventing the leaked > block to be removed from the approx count. So, for errors to be corrected, we > have to not write any data for the roll period of 10 minutes. The number of > blocks we write per 10 minutes is quite high. This allows the error on the > approx counts to grow to very large numbers. > The comments in the ticket for the original implementation suggest this > issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, > it's not clear to me if the severity of it was known at the time. > > So if there are some blocks that are not reported back by the datanode, > > they will eventually get adjusted (usually 10 min; bit longer if datanode > > is continuously receiving blocks). > The comments suggest it will eventually get cleared out, but in our case, it > never gets cleared out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15372) Files in snapshots no longer see attribute provider permissions
[ https://issues.apache.org/jira/browse/HDFS-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139441#comment-17139441 ] Hudson commented on HDFS-15372: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18363 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18363/]) Revert "HDFS-15372. Files in snapshots no longer see attribute provider (weichiu: rev edf716a5c3ed7f51c994ec8bcc460445f9bb8ece) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestINodeAttributeProvider.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSPermissionChecker.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodesInPath.java HDFS-15372. Files in snapshots no longer see attribute provider (weichiu: rev d50e93ce7b6aba235ecc0143fe2c7a0150a3ceae) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSPermissionChecker.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodesInPath.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestINodeAttributeProvider.java > Files in snapshots no longer see attribute provider permissions > --- > > Key: HDFS-15372 > URL: https://issues.apache.org/jira/browse/HDFS-15372 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 3.3.1, 3.4.0 > > Attachments: HDFS-15372.001.patch, HDFS-15372.002.patch, > HDFS-15372.003.patch, HDFS-15372.004.patch, HDFS-15372.005.patch > > > Given a cluster with an authorization provider configured (eg Sentry) and the > paths covered by the provider are snapshotable, there was a change in > behaviour in how the provider permissions and ACLs are applied to files in > snapshots between the 2.x branch and Hadoop 3.0. > Eg, if we have the snapshotable path /data, which is Sentry managed. The ACLs > below are provided by Sentry: > {code} > hadoop fs -getfacl -R /data > # file: /data > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/tab1 > # owner: hive > # group: hive > user::rwx > group::--- > group:flume:rwx > user:hive:rwx > group:hive:rwx > group:testgroup:rwx > mask::rwx > other::--x > /data/tab1 > {code} > After taking a snapshot, the files in the snapshot do not see the provider > permissions: > {code} > hadoop fs -getfacl -R /data/.snapshot > # file: /data/.snapshot > # owner: > # group: > user::rwx > group::rwx > other::rwx > # file: /data/.snapshot/snap1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/.snapshot/snap1/tab1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > {code} > However pre-Hadoop 3.0 (when the attribute provider etc was extensively > refactored) snapshots did get the provider permissions. > The reason is this code in FSDirectory.java which ultimately calls the > attribute provider and passes the path we want permissions for: > {code} > INodeAttributes getAttributes(INodesInPath iip) > throws IOException { > INode node = FSDirectory.resolveLastINode(iip); > int snapshot = iip.getPathSnapshotId(); > INodeAttributes nodeAttrs = node.getSnapshotINode(snapshot); > UserGroupInformation ugi = NameNode.getRemoteUser(); > INodeAttributeProvider ap = this.getUserFilteredAttributeProvider(ugi); > if (ap != null) { > // permission checking sends the full components array including the > // first empty component for the root. however file status > // related calls are expected to strip out the root component according > // to TestINodeAttributeProvider. > byte[][] components = iip.getPathComponents(); > components = Arrays.copyOfRange(components, 1, components.length); > nodeAttrs = ap.getAttributes(components, nodeAttrs); > } > return nodeAttrs; > } > {code} > The line: > {code} > INode node = FSDirectory.resolveLastINode(iip); > {code} > Picks the last resolved Inode and if you then call node.getPathComponents, > for a path like '/data/.snapshot/snap1/tab1' it will return /data/tab1. It > resolves the snapshot path to its original location, but its still the > snapshot inode. > However the logic passes 'iip.getPathComponents' which returns > "/user/.snapshot/snap1/tab" to the provider. > The pre Hadoop 3.0 code passes the i
[jira] [Commented] (HDFS-15420) approx scheduled blocks not reseting over time
[ https://issues.apache.org/jira/browse/HDFS-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139440#comment-17139440 ] Max Mizikar commented on HDFS-15420: We are forking and switching the order in which currApprox and prevApprox get decremented. We will be decrementing from currApprox first. I don't think this is a good solution for everyone. It's much more likely to undercount in a functioning system than the current implementation. We are running deployments where all nodes have the same disk size and have alerts long before we fill up disk and need to worry about scheduled too much. We have also considered making currApprox and prevApprox a map from block to count. We have run this as a test for a bit and it seemed to work somewhat well. It's certainly more cpu and memory and requires more synchronization, but we have not had it be an issue. > approx scheduled blocks not reseting over time > -- > > Key: HDFS-15420 > URL: https://issues.apache.org/jira/browse/HDFS-15420 > Project: Hadoop HDFS > Issue Type: Bug > Components: block placement >Affects Versions: 2.6.0, 3.0.0 > Environment: Our 2.6.0 environment is a 3 node cluster running > cdh5.15.0. > Our 3.0.0 environment is a 4 node cluster running cdh6.3.0. >Reporter: Max Mizikar >Priority: Minor > Attachments: Screenshot from 2020-06-18 09-29-57.png, Screenshot from > 2020-06-18 09-31-15.png > > > We have been experiencing large amounts of scheduled blocks that never get > cleared out. This is preventing blocks from being placed even when there is > plenty of space on the system. > Here is an example of the block growth over 24 hours on one of our systems > running 2.6.0 > !Screenshot from 2020-06-18 09-29-57.png! > Here is an example of the block growth over 24 hours on one of our systems > running 3.0.0 > !Screenshot from 2020-06-18 09-31-15.png! > https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue > we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, > however, there appears to still be a systemic growth in scheduled blocks over > time and our systems will still need to restart the namenode on occasion to > reset this count. I have not determined what is causing the leaked blocks in > 3.0.0. > Looking into the issue, I discovered that the intention is for scheduled > blocks to slowly go back down to 0 after errors cause blocks to be leaked. > {code} > /** Increment the number of blocks scheduled. */ > void incrementBlocksScheduled(StorageType t) { > currApproxBlocksScheduled.add(t, 1); > } > > /** Decrement the number of blocks scheduled. */ > void decrementBlocksScheduled(StorageType t) { > if (prevApproxBlocksScheduled.get(t) > 0) { > prevApproxBlocksScheduled.subtract(t, 1); > } else if (currApproxBlocksScheduled.get(t) > 0) { > currApproxBlocksScheduled.subtract(t, 1); > } > // its ok if both counters are zero. > } > > /** Adjusts curr and prev number of blocks scheduled every few minutes. */ > private void rollBlocksScheduled(long now) { > if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) { > prevApproxBlocksScheduled.set(currApproxBlocksScheduled); > currApproxBlocksScheduled.reset(); > lastBlocksScheduledRollTime = now; > } > } > {code} > However, this code does not do what is intended if the system has a constant > flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the > next scheduled block increments currApproxBlocksScheduled and when it > completes, it decrements prevApproxBlocksScheduled preventing the leaked > block to be removed from the approx count. So, for errors to be corrected, we > have to not write any data for the roll period of 10 minutes. The number of > blocks we write per 10 minutes is quite high. This allows the error on the > approx counts to grow to very large numbers. > The comments in the ticket for the original implementation suggest this > issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, > it's not clear to me if the severity of it was known at the time. > > So if there are some blocks that are not reported back by the datanode, > > they will eventually get adjusted (usually 10 min; bit longer if datanode > > is continuously receiving blocks). > The comments suggest it will eventually get cleared out, but in our case, it > never gets cleared out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15420) approx scheduled blocks not reseting over time
Max Mizikar created HDFS-15420: -- Summary: approx scheduled blocks not reseting over time Key: HDFS-15420 URL: https://issues.apache.org/jira/browse/HDFS-15420 Project: Hadoop HDFS Issue Type: Bug Components: block placement Affects Versions: 3.0.0, 2.6.0 Environment: Our 2.6.0 environment is a 3 node cluster running cdh5.15.0. Our 3.0.0 environment is a 4 node cluster running cdh6.3.0. Reporter: Max Mizikar Attachments: Screenshot from 2020-06-18 09-29-57.png, Screenshot from 2020-06-18 09-31-15.png We have been experiencing large amounts of scheduled blocks that never get cleared out. This is preventing blocks from being placed even when there is plenty of space on the system. Here is an example of the block growth over 24 hours on one of our systems running 2.6.0 !Screenshot from 2020-06-18 09-29-57.png! Here is an example of the block growth over 24 hours on one of our systems running 3.0.0 !Screenshot from 2020-06-18 09-31-15.png! https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, however, there appears to still be a systemic growth in scheduled blocks over time and our systems will still need to restart the namenode on occasion to reset this count. I have not determined what is causing the leaked blocks in 3.0.0. Looking into the issue, I discovered that the intention is for scheduled blocks to slowly go back down to 0 after errors cause blocks to be leaked. {code} /** Increment the number of blocks scheduled. */ void incrementBlocksScheduled(StorageType t) { currApproxBlocksScheduled.add(t, 1); } /** Decrement the number of blocks scheduled. */ void decrementBlocksScheduled(StorageType t) { if (prevApproxBlocksScheduled.get(t) > 0) { prevApproxBlocksScheduled.subtract(t, 1); } else if (currApproxBlocksScheduled.get(t) > 0) { currApproxBlocksScheduled.subtract(t, 1); } // its ok if both counters are zero. } /** Adjusts curr and prev number of blocks scheduled every few minutes. */ private void rollBlocksScheduled(long now) { if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) { prevApproxBlocksScheduled.set(currApproxBlocksScheduled); currApproxBlocksScheduled.reset(); lastBlocksScheduledRollTime = now; } } {code} However, this code does not do what is intended if the system has a constant flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the next scheduled block increments currApproxBlocksScheduled and when it completes, it decrements prevApproxBlocksScheduled preventing the leaked block to be removed from the approx count. So, for errors to be corrected, we have to not write any data for the roll period of 10 minutes. The number of blocks we write per 10 minutes is quite high. This allows the error on the approx counts to grow to very large numbers. The comments in the ticket for the original implementation suggest this issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, it's not clear to me if the severity of it was known at the time. > So if there are some blocks that are not reported back by the datanode, they > will eventually get adjusted (usually 10 min; bit longer if datanode is > continuously receiving blocks). The comments suggest it will eventually get cleared out, but in our case, it never gets cleared out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15372) Files in snapshots no longer see attribute provider permissions
[ https://issues.apache.org/jira/browse/HDFS-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139432#comment-17139432 ] Wei-Chiu Chuang commented on HDFS-15372: Accidentally committed version 004 instead of the last, 005 version. This is now corrected. > Files in snapshots no longer see attribute provider permissions > --- > > Key: HDFS-15372 > URL: https://issues.apache.org/jira/browse/HDFS-15372 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 3.3.1, 3.4.0 > > Attachments: HDFS-15372.001.patch, HDFS-15372.002.patch, > HDFS-15372.003.patch, HDFS-15372.004.patch, HDFS-15372.005.patch > > > Given a cluster with an authorization provider configured (eg Sentry) and the > paths covered by the provider are snapshotable, there was a change in > behaviour in how the provider permissions and ACLs are applied to files in > snapshots between the 2.x branch and Hadoop 3.0. > Eg, if we have the snapshotable path /data, which is Sentry managed. The ACLs > below are provided by Sentry: > {code} > hadoop fs -getfacl -R /data > # file: /data > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/tab1 > # owner: hive > # group: hive > user::rwx > group::--- > group:flume:rwx > user:hive:rwx > group:hive:rwx > group:testgroup:rwx > mask::rwx > other::--x > /data/tab1 > {code} > After taking a snapshot, the files in the snapshot do not see the provider > permissions: > {code} > hadoop fs -getfacl -R /data/.snapshot > # file: /data/.snapshot > # owner: > # group: > user::rwx > group::rwx > other::rwx > # file: /data/.snapshot/snap1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/.snapshot/snap1/tab1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > {code} > However pre-Hadoop 3.0 (when the attribute provider etc was extensively > refactored) snapshots did get the provider permissions. > The reason is this code in FSDirectory.java which ultimately calls the > attribute provider and passes the path we want permissions for: > {code} > INodeAttributes getAttributes(INodesInPath iip) > throws IOException { > INode node = FSDirectory.resolveLastINode(iip); > int snapshot = iip.getPathSnapshotId(); > INodeAttributes nodeAttrs = node.getSnapshotINode(snapshot); > UserGroupInformation ugi = NameNode.getRemoteUser(); > INodeAttributeProvider ap = this.getUserFilteredAttributeProvider(ugi); > if (ap != null) { > // permission checking sends the full components array including the > // first empty component for the root. however file status > // related calls are expected to strip out the root component according > // to TestINodeAttributeProvider. > byte[][] components = iip.getPathComponents(); > components = Arrays.copyOfRange(components, 1, components.length); > nodeAttrs = ap.getAttributes(components, nodeAttrs); > } > return nodeAttrs; > } > {code} > The line: > {code} > INode node = FSDirectory.resolveLastINode(iip); > {code} > Picks the last resolved Inode and if you then call node.getPathComponents, > for a path like '/data/.snapshot/snap1/tab1' it will return /data/tab1. It > resolves the snapshot path to its original location, but its still the > snapshot inode. > However the logic passes 'iip.getPathComponents' which returns > "/user/.snapshot/snap1/tab" to the provider. > The pre Hadoop 3.0 code passes the inode directly to the provider, and hence > it only ever sees the path as "/user/data/tab1". > It is debatable which path should be passed to the provider - > /user/.snapshot/snap1/tab or /data/tab1 in the case of snapshots. However as > the behaviour has changed I feel we should ensure the old behaviour is > retained. > It would also be fairly easy to provide a config switch so the provider gets > the full snapshot path or the resolved path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139379#comment-17139379 ] Stephen O'Donnell commented on HDFS-15406: -- Committed this to all active 3.x branches with no conflicts. Thanks for the contribution [~hemanthboyina]! > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5 > > Attachments: HDFS-15406.001.patch, HDFS-15406.002.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-15406: - Resolution: Fixed Status: Resolved (was: Patch Available) > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5 > > Attachments: HDFS-15406.001.patch, HDFS-15406.002.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-15406: - Fix Version/s: 3.1.5 3.4.0 3.3.1 3.2.2 > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5 > > Attachments: HDFS-15406.001.patch, HDFS-15406.002.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15419) Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139351#comment-17139351 ] Yuxuan Wang commented on HDFS-15419: [~ayushtkn] But router will retry or failover once currently in code. I don't know why it is involved by some patch. Should we file a jira to remove the logic ? > Router should retry communicate with NN when cluster is unavailable using > configurable time interval > > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139346#comment-17139346 ] Hudson commented on HDFS-15406: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18362 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18362/]) HDFS-15406. Improve the speed of Datanode Block Scan. Contributed by (sodonnell: rev 123777823edc98553fcef61f1913ab6e4cd5aa9a) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/DataNodeTestUtils.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15406.001.patch, HDFS-15406.002.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15417) Lazy get the datanode report for federation WebHDFS operations
[ https://issues.apache.org/jira/browse/HDFS-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139299#comment-17139299 ] Ayush Saxena commented on HDFS-15417: - It indeed makes sense to minimize the the getDatanodeReport() in whichever ways possible. Couldn't check the code but it shall be good if we can restrict it. Usually in a big cluster I too have observed getDatanodeReport() is quite heavy. If it still bothers much we can think about caching and stuff like that as well.. > Lazy get the datanode report for federation WebHDFS operations > -- > > Key: HDFS-15417 > URL: https://issues.apache.org/jira/browse/HDFS-15417 > Project: Hadoop HDFS > Issue Type: Improvement > Components: federation, rbf, webhdfs >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > > *Why* > For WebHDFS CREATE, OPEN, APPEND and GETFILECHECKSUM operations, router or > namenode needs to get the datanodes where the block is located, then redirect > the request to one of the datanodes. > However, this chooseDatanode action in router is much slower than namenode, > which directly affects the WebHDFS operations above. > For namenode WebHDFS, it normally takes tens of milliseconds, while router > always takes more than 2 seconds. > *How* > Only get the datanode report when necessary in router. It is a very expense > operation where all the time is spent on. > This is only needed when we want to exclude some datanodes or find a random > datanode for CREATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15419) Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139292#comment-17139292 ] Ayush Saxena commented on HDFS-15419: - I think this has been somewhere discussed before as well. Router is just a proxy, It just needs to take the call from the client and proxy to the nameservice, and whatever response it gets from the nameservice it has to give it back to the actual client. It is up to the actual client discretion whether it want's to wait/retry or not. Holding and retrying a call at Router doesn't seems much apt to me. The retry logics are already there at the Client side codes, This may lead to double retries too, and would be better the client only decides whether he needs to try again or not. > Router should retry communicate with NN when cluster is unavailable using > configurable time interval > > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15419) Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139275#comment-17139275 ] Yuxuan Wang commented on HDFS-15419: Hi~[~bhji123] If router retry more times and longer, but clients' timeout and retry times are also here, how it works? If router retry, but nn is still unavaliabe, and then clients timeout, finally clients retry. In this case, what's different between let router retry or let clients retry? > Router should retry communicate with NN when cluster is unavailable using > configurable time interval > > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15419) Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuxuan Wang reassigned HDFS-15419: -- Assignee: (was: Yuxuan Wang) > Router should retry communicate with NN when cluster is unavailable using > configurable time interval > > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15419) Router should retry communicate with NN when cluster is unavailable using configurable time interval
[ https://issues.apache.org/jira/browse/HDFS-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuxuan Wang reassigned HDFS-15419: -- Assignee: Yuxuan Wang > Router should retry communicate with NN when cluster is unavailable using > configurable time interval > > > Key: HDFS-15419 > URL: https://issues.apache.org/jira/browse/HDFS-15419 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, hdfs-client, rbf >Reporter: bhji123 >Assignee: Yuxuan Wang >Priority: Major > > When cluster is unavailable, router -> namenode communication will only retry > once without any time interval, that is not reasonable. > For example, in my company, which has several hdfs clusters with more than > 1000 nodes, we have encountered this problem. In some cases, the cluster > becomes unavailable briefly for about 10 or 30 seconds, at the same time, > almost all rpc requests to router failed because router only retry once > without time interval. > It's better for us to enhance the router retry strategy, to retry > **communicate with NN using configurable time interval and max retry times. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139216#comment-17139216 ] Hadoop QA commented on HDFS-15410: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 24s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 36s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 16s{color} | {color:green} hadoop-federation-balance in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 62m 7s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-HDFS-Build/29438/artifact/out/Dockerfile | | JIRA Issue | HDFS-15410 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13005930/HDFS-15410.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 22c1f191b89c 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 9cbd76cc775 | | Default Java | Private Build-1.8.0_252-8u252-b09-1~
[jira] [Updated] (HDFS-15410) Add separated config file fedbalance-default.xml for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15410: --- Attachment: HDFS-15410.001.patch Status: Patch Available (was: Open) > Add separated config file fedbalance-default.xml for fedbalance tool > > > Key: HDFS-15410 > URL: https://issues.apache.org/jira/browse/HDFS-15410 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15410.001.patch > > > Add a separated config file named fedbalance-default.xml for fedbalance tool > configs. It's like the ditcp-default.xml for distcp tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15374) Add documentation for fedbalance tool
[ https://issues.apache.org/jira/browse/HDFS-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139148#comment-17139148 ] Jinglun commented on HDFS-15374: Hi [~linyiqun], thanks your comments ! Upload v02 and the image BalanceProcedureScheduler.png. > Add documentation for fedbalance tool > - > > Key: HDFS-15374 > URL: https://issues.apache.org/jira/browse/HDFS-15374 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: BalanceProcedureScheduler.png, HDFS-15374.001.patch, > HDFS-15374.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org