[jira] [Work logged] (HDFS-15974) RBF: Unable to display the datanode UI of the router
[ https://issues.apache.org/jira/browse/HDFS-15974?focusedWorklogId=585571=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585571 ] ASF GitHub Bot logged work on HDFS-15974: - Author: ASF GitHub Bot Created on: 20/Apr/21 05:53 Start Date: 20/Apr/21 05:53 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2915: URL: https://github.com/apache/hadoop/pull/2915#issuecomment-822993787 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 39s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 36m 8s | | trunk passed | | +1 :green_heart: | compile | 0m 41s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 0m 36s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 28s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 42s | | trunk passed | | +1 :green_heart: | javadoc | 0m 40s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 0m 56s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 15s | | trunk passed | | +1 :green_heart: | shadedclient | 14m 20s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 32s | | the patch passed | | +1 :green_heart: | compile | 0m 32s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 0m 32s | | the patch passed | | +1 :green_heart: | compile | 0m 28s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 0m 28s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 17s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 31s | | the patch passed | | +1 :green_heart: | javadoc | 0m 31s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 0m 48s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 15s | | the patch passed | | +1 :green_heart: | shadedclient | 14m 14s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 17m 46s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2915/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 34s | | The patch does not generate ASF License warnings. | | | | 95m 34s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouterRpcMultiDestination | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2915/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2915 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux 8a84274adc84 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 1ac78a83e20f267b8c5da2831823c594f01e8951 | | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2915/2/testReport/ | | Max. process+thread count | 2352 (vs. ulimit of 5500) | | modules
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=585554=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585554 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 20/Apr/21 05:00 Start Date: 20/Apr/21 05:00 Worklog Time Spent: 10m Work Description: virajjasani commented on a change in pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#discussion_r616344648 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer2.java ## @@ -0,0 +1,746 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdfs.server.balancer; + +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.StorageType; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSClient; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.DFSTestUtil; +import org.apache.hadoop.hdfs.DFSUtil; +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.HdfsConfiguration; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.NameNodeProxies; +import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys; +import org.apache.hadoop.hdfs.protocol.ClientProtocol; +import org.apache.hadoop.hdfs.protocol.DatanodeID; +import org.apache.hadoop.hdfs.protocol.DatanodeInfo; +import org.apache.hadoop.hdfs.protocol.HdfsConstants; +import org.apache.hadoop.hdfs.protocol.LocatedBlock; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementStatus; +import org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager; +import org.apache.hadoop.hdfs.server.datanode.DataNode; +import org.apache.hadoop.hdfs.server.datanode.DataNodeTestUtils; +import org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset; +import org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.test.GenericTestUtils; +import org.junit.After; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.slf4j.event.Level; + +import java.io.OutputStream; +import java.net.InetSocketAddress; +import java.net.URI; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import static org.apache.hadoop.fs.StorageType.DEFAULT; +import static org.apache.hadoop.fs.StorageType.RAM_DISK; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BLOCK_SIZE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_BLOCK_PINNING_ENABLED; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_LAZY_WRITER_INTERVAL_SEC; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_MAX_LOCKED_MEMORY_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_LAZY_PERSIST_FILE_SCRUB_INTERVAL_SEC; +import static org.apache.hadoop.test.PlatformAssumptions.assumeNotWindows; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +/** + * Some long running Balancer tasks. + */ +public class TestBalancer2 { Review comment: Thanks @aajisaka. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking
[jira] [Commented] (HDFS-15963) Unreleased volume references cause an infinite loop
[ https://issues.apache.org/jira/browse/HDFS-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325472#comment-17325472 ] Xiaoqiao He commented on HDFS-15963: [~zhangshuyan] would you like to offer another patch for branch-3.3, just check it could not cherry-pick back to branch-3.3 smooth. > Unreleased volume references cause an infinite loop > --- > > Key: HDFS-15963 > URL: https://issues.apache.org/jira/browse/HDFS-15963 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-15963.001.patch, HDFS-15963.002.patch, > HDFS-15963.003.patch > > Time Spent: 4h > Remaining Estimate: 0h > > When BlockSender throws an exception because the meta-data cannot be found, > the volume reference obtained by the thread is not released, which causes the > thread trying to remove the volume to wait and fall into an infinite loop. > {code:java} > boolean checkVolumesRemoved() { > Iterator it = volumesBeingRemoved.iterator(); > while (it.hasNext()) { > FsVolumeImpl volume = it.next(); > if (!volume.checkClosed()) { > return false; > } > it.remove(); > } > return true; > } > boolean checkClosed() { > // always be true. > if (this.reference.getReferenceCount() > 0) { > FsDatasetImpl.LOG.debug("The reference count for {} is {}, wait to be 0.", > this, reference.getReferenceCount()); > return false; > } > return true; > } > {code} > At the same time, because the thread has been holding checkDirsLock when > removing the volume, other threads trying to acquire the same lock will be > permanently blocked. > Similar problems also occur in RamDiskAsyncLazyPersistService and > FsDatasetAsyncDiskService. > This patch releases the three previously unreleased volume references. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=585551=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585551 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 20/Apr/21 04:29 Start Date: 20/Apr/21 04:29 Worklog Time Spent: 10m Work Description: aajisaka commented on a change in pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#discussion_r616335313 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer2.java ## @@ -0,0 +1,746 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdfs.server.balancer; + +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.StorageType; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSClient; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.DFSTestUtil; +import org.apache.hadoop.hdfs.DFSUtil; +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.HdfsConfiguration; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.NameNodeProxies; +import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys; +import org.apache.hadoop.hdfs.protocol.ClientProtocol; +import org.apache.hadoop.hdfs.protocol.DatanodeID; +import org.apache.hadoop.hdfs.protocol.DatanodeInfo; +import org.apache.hadoop.hdfs.protocol.HdfsConstants; +import org.apache.hadoop.hdfs.protocol.LocatedBlock; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementStatus; +import org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager; +import org.apache.hadoop.hdfs.server.datanode.DataNode; +import org.apache.hadoop.hdfs.server.datanode.DataNodeTestUtils; +import org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset; +import org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.test.GenericTestUtils; +import org.junit.After; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.slf4j.event.Level; + +import java.io.OutputStream; +import java.net.InetSocketAddress; +import java.net.URI; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import static org.apache.hadoop.fs.StorageType.DEFAULT; +import static org.apache.hadoop.fs.StorageType.RAM_DISK; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BLOCK_SIZE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_BLOCK_PINNING_ENABLED; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_LAZY_WRITER_INTERVAL_SEC; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_MAX_LOCKED_MEMORY_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_LAZY_PERSIST_FILE_SCRUB_INTERVAL_SEC; +import static org.apache.hadoop.test.PlatformAssumptions.assumeNotWindows; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +/** + * Some long running Balancer tasks. + */ +public class TestBalancer2 { Review comment: Would you rename the class to describe the test cases, such as TestBalancerLongRunningTasks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact
[jira] [Work logged] (HDFS-15974) RBF: Unable to display the datanode UI of the router
[ https://issues.apache.org/jira/browse/HDFS-15974?focusedWorklogId=585548=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585548 ] ASF GitHub Bot logged work on HDFS-15974: - Author: ASF GitHub Bot Created on: 20/Apr/21 04:19 Start Date: 20/Apr/21 04:19 Worklog Time Spent: 10m Work Description: zhuxiangyi commented on a change in pull request #2915: URL: https://github.com/apache/hadoop/pull/2915#discussion_r616332454 ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/NamenodeBeanMetrics.java ## @@ -855,7 +855,7 @@ public long getNumberOfSnapshottableDirs() { @Override public String getEnteringMaintenanceNodes() { -return "N/A"; +return null; Review comment: Thanks for your review @goiri , I think returning "{}" is more friendly than "null". I submitted new code and added tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585548) Time Spent: 0.5h (was: 20m) > RBF: Unable to display the datanode UI of the router > > > Key: HDFS-15974 > URL: https://issues.apache.org/jira/browse/HDFS-15974 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf, ui >Affects Versions: 3.4.0 >Reporter: zhu >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15358-1.patch, image-2021-04-15-11-36-47-644.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > Clicking the Datanodes tag on the Router UI does not respond. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15975) Use LongAdder instead of AtomicLong
[ https://issues.apache.org/jira/browse/HDFS-15975?focusedWorklogId=585547=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585547 ] ASF GitHub Bot logged work on HDFS-15975: - Author: ASF GitHub Bot Created on: 20/Apr/21 04:18 Start Date: 20/Apr/21 04:18 Worklog Time Spent: 10m Work Description: tomscut opened a new pull request #2936: URL: https://github.com/apache/hadoop/pull/2936 JIRA: [HDFS-15975](https://issues.apache.org/jira/browse/HDFS-15975) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585547) Time Spent: 4h 10m (was: 4h) > Use LongAdder instead of AtomicLong > --- > > Key: HDFS-15975 > URL: https://issues.apache.org/jira/browse/HDFS-15975 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When counting some indicators, we can use LongAdder instead of AtomicLong to > improve performance. The long value is not an atomic snapshot in LongAdder, > but I think we can tolerate that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8023) Erasure Coding: retrieve eraure coding schema for a file from NameNode
[ https://issues.apache.org/jira/browse/HDFS-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325441#comment-17325441 ] Stone commented on HDFS-8023: - To use EC feature I will upgrade hdfs from 2.8 to 3.0+. if upgrade active NameNode first,standByNode will read new editlog including EC feature,then standByNode will failure as it can not resolve ec editlog.How to upgrade NN > Erasure Coding: retrieve eraure coding schema for a file from NameNode > -- > > Key: HDFS-8023 > URL: https://issues.apache.org/jira/browse/HDFS-8023 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Vinayakumar B >Priority: Major > Fix For: HDFS-7285 > > Attachments: HDFS-8023-01.patch, HDFS-8023-02.patch > > > NameNode needs to provide RPC call for client and tool to retrieve eraure > coding schema for a file from NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
[ https://issues.apache.org/jira/browse/HDFS-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325409#comment-17325409 ] Xuze Yang edited comment on HDFS-15990 at 4/20/21, 1:30 AM: [~sodonnell] Thank you for your timely reply. I got it. Now I just think it may cause a little confusion. Would code comments be a better choice?Anyway, this is not a big problem. I want to ask another question. In HDFS-7784 and HDFS-14617, they both aim to load INodeSection and INodeDirectorySection in parallel. But their implementation is a bit different. Take INodeSection for example, HDFS-7784 divide INodeSection into several INodeSection, while HDFS-14617 introduce INodeSection_Sub. In practice, HDFS-14617 may encounter downgrade problems(see HDFS-14771), but HDFS-7784 does not have this problem. So I want to ask why you choose HDFS-14617 implementation. In another word, compare to HDFS-7784, what are the advantages of HDFS-14617. Looking forward to your answer, thanks! was (Author: xuze yang): [~sodonnell] Thank you for your timely reply. I got it. Now I just think it may cause a lit confusion. Would code comments be a better choice?Anyway, this is not a big problem. I want to ask another question. In HDFS-7784 and HDFS-14617, they both aim to load INodeSection and INodeDirectorySection in parallel. But their implementation is a bit different. Take INodeSection for example, HDFS-7784 divide INodeSection into several INodeSection, while HDFS-14617 introduce INodeSection_Sub. In practice, HDFS-14617 may encounter downgrade problems(see HDFS-14771), but HDFS-7784 does not have this problem. So I want to ask why you choose HDFS-14617 implementation. In another word, compare to HDFS-7784, what are the advantages of HDFS-14617. Looking forward to your answer, thanks! > No need to write to sub_section when serialize SnapshotDiff Section > --- > > Key: HDFS-15990 > URL: https://issues.apache.org/jira/browse/HDFS-15990 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: Xuze Yang >Priority: Minor > > In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code > exists: > {code:java} > if (i % parent.getInodesPerSubSection() == 0) { > parent.commitSubSection(headers, > FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); > }{code} > It aims to serialize SnapshotDiff information into several sub_sections(i.e. > additional sub_sections information will be written to FileSummary Section). > But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats > SnapshotDiffSection as a whole, rather than several sub_sections. So it's no > need to introduce sub_sections here. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
[ https://issues.apache.org/jira/browse/HDFS-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325409#comment-17325409 ] Xuze Yang commented on HDFS-15990: -- [~sodonnell] Thank you for your timely reply. I got it. Now I just think it may cause a lit confusion. Would code comments be a better choice?Anyway, this is not a big problem. I want to ask another question. In HDFS-7784 and HDFS-14617, they both aim to load INodeSection and INodeDirectorySection in parallel. But their implementation is a bit different. Take INodeSection for example, HDFS-7784 divide INodeSection into several INodeSection, while HDFS-14617 introduce INodeSection_Sub. In practice, HDFS-14617 may encounter downgrade problems(see HDFS-14771), but HDFS-7784 does not have this problem. So I want to ask why you choose HDFS-14617 implementation. In another word, compare to HDFS-7784, what are the advantages of HDFS-14617. Looking forward to your answer, thanks! > No need to write to sub_section when serialize SnapshotDiff Section > --- > > Key: HDFS-15990 > URL: https://issues.apache.org/jira/browse/HDFS-15990 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: Xuze Yang >Priority: Minor > > In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code > exists: > {code:java} > if (i % parent.getInodesPerSubSection() == 0) { > parent.commitSubSection(headers, > FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); > }{code} > It aims to serialize SnapshotDiff information into several sub_sections(i.e. > additional sub_sections information will be written to FileSummary Section). > But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats > SnapshotDiffSection as a whole, rather than several sub_sections. So it's no > need to introduce sub_sections here. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585507=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585507 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 20/Apr/21 01:22 Start Date: 20/Apr/21 01:22 Worklog Time Spent: 10m Work Description: functioner edited a comment on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822806901 > No, it's a non-blocking write so by definition it will never hang – unless induced by fault injection. @daryn-sharp I have considered this counterargument in https://github.com/apache/hadoop/pull/2737#issuecomment-822591028, where I proposed another argument that it may hang when it's a huge payload, because there's a while-loop. Please take a look. Yes, `channel.write(buffer)` is non-blocking. But I suspect that `channelIO(null, channel, buffer)` is blocking, otherwise we won't have the performance issue in [HDFS-15486](https://issues.apache.org/jira/browse/HDFS-15486). Within `channelIO(null, channel, buffer)`, the large payload is split into multiple parts, and it will won't jump out of the loop until the remaining part of payload does not exceed the buffer limit, meaning that it's waiting for the network to finish sending some content. I'm not sure whether it can defend my argument. Can you provide more explanation? Maybe I'm not correct. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585507) Time Spent: 6.5h (was: 6h 20m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 6.5h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { >
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585505=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585505 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 20/Apr/21 01:18 Start Date: 20/Apr/21 01:18 Worklog Time Spent: 10m Work Description: functioner edited a comment on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822806901 > No, it's a non-blocking write so by definition it will never hang – unless induced by fault injection. @daryn-sharp I have considered this counterargument in https://github.com/apache/hadoop/pull/2737#issuecomment-822591028, where I proposed another argument that it may hang when it's a huge payload, because there's a while-loop. Please take a look. Yes, `channel.write(buffer)` is non-blocking. But I suspect that `channelIO(null, channel, buffer)` is blocking, otherwise we won't have the performance issue in [HDFS-15486](https://issues.apache.org/jira/browse/HDFS-15486). Within `channelIO(null, channel, buffer)`, the large payload is split into multiple parts, and only the last call is non-blocking due to the buffer limit. I'm not sure whether it can defend my argument. Can you provide more explanation? Maybe I'm not correct. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585505) Time Spent: 6h 20m (was: 6h 10m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 6h 20m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... >
[jira] [Commented] (HDFS-15614) Initialize snapshot trash root during NameNode startup if enabled
[ https://issues.apache.org/jira/browse/HDFS-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325352#comment-17325352 ] Arpit Agarwal commented on HDFS-15614: -- bq. , if providing an external command to create the Trash directory by admins is feasible and makes sense The external command will add more friction to enabling the feature. We want it to be transparent as far as possible. I like the option to auto create the .Trash dir better. > Initialize snapshot trash root during NameNode startup if enabled > - > > Key: HDFS-15614 > URL: https://issues.apache.org/jira/browse/HDFS-15614 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > This is a follow-up to HDFS-15607. > Goal: > Initialize (create) snapshot trash root for all existing snapshottable > directories if {{dfs.namenode.snapshot.trashroot.enabled}} is set to > {{true}}. So admins won't have to run {{dfsadmin -provisionTrash}} manually > on all those existing snapshottable directories. > The change is expected to land in {{FSNamesystem}}. > Discussion: > 1. Currently in HDFS-15607, the snapshot trash root creation logic is on the > client side. But in order for NN to create it at startup, the logic must > (also) be implemented on the server side as well. -- which is also a > requirement by WebHDFS (HDFS-15612). > 2. Alternatively, we can provide an extra parameter to the > {{-provisionTrash}} command like: {{dfsadmin -provisionTrash -all}} to > initialize/provision trash root on all existing snapshottable dirs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=585413=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585413 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 19/Apr/21 22:26 Start Date: 19/Apr/21 22:26 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#issuecomment-822827055 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 54s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 10s | | trunk passed | | +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 19s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 7s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 31s | | trunk passed | | +1 :green_heart: | javadoc | 0m 59s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 31s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 24s | | trunk passed | | +1 :green_heart: | shadedclient | 17m 5s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 17s | | the patch passed | | +1 :green_heart: | compile | 1m 23s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 23s | | hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 generated 0 new + 469 unchanged - 1 fixed = 469 total (was 470) | | +1 :green_heart: | compile | 1m 11s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 11s | | hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 generated 0 new + 453 unchanged - 1 fixed = 453 total (was 454) | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 56s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2923/11/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 7 new + 158 unchanged - 49 fixed = 165 total (was 207) | | +1 :green_heart: | mvnsite | 1m 16s | | the patch passed | | +1 :green_heart: | javadoc | 0m 50s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 20s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 24s | | the patch passed | | +1 :green_heart: | shadedclient | 16m 50s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 389m 18s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2923/11/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 46s | | The patch does not generate ASF License warnings. | | | | 484m 41s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.web.TestWebHdfsFileSystemContract | | | hadoop.hdfs.TestPersistBlocks | | | hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor | | | hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand | | | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.server.namenode.TestFileTruncate | | |
[jira] [Work logged] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang
[ https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=585412=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585412 ] ASF GitHub Bot logged work on HDFS-15957: - Author: ASF GitHub Bot Created on: 19/Apr/21 22:24 Start Date: 19/Apr/21 22:24 Worklog Time Spent: 10m Work Description: functioner commented on a change in pull request #2878: URL: https://github.com/apache/hadoop/pull/2878#discussion_r616213541 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java ## @@ -378,13 +381,18 @@ public void logSyncWait() { @Override public void logSyncNotify(RuntimeException syncEx) { - try { -if (syncEx == null) { - call.sendResponse(); -} else { - call.abortResponse(syncEx); + for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) { Review comment: @daryn-sharp Thanks for your explanation! Now I understand that: 1. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response. 2. If a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client. And I have more questions: 1. What kind of messages will be sent in this specific `call.sendResponse()`? 2. Among these messages, are there critical ones? I mean, maybe some messages are important to the protocol and we don't want to lose them. If we delegate them to a dedicated RPC thread/service/framework, the disconnection/exception can be automatically handled so that those important messages can reliably reach the receiver's side. The `call.sendResponse()` is invoked after the edit log sync finishes, so the transaction has been done. However, it swallows the possible exception so it doesn't care whether it successfully replies (to datanode/client). Even if we have the implication of disconnection in this scenario, I was wondering if it's possible that a bug is hidden by this possibly ignored message? For example, If the receiver (datanode/client) is doing some end-to-end retry (e.g., can't get the reply and request the same transaction again), then this identical transaction may get rejected because it's already committed in the edit log. Probably we should at least add some warning/logging when `call.sendResponse()` fails to send the message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585412) Time Spent: 1.5h (was: 1h 20m) > The ignored IOException in the RPC response sent by FSEditLogAsync can cause > the HDFS client to hang > > > Key: HDFS-15957 > URL: https://issues.apache.org/jira/browse/HDFS-15957 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Critical > Labels: pull-request-available > Attachments: fsshell.txt, namenode.txt, reproduce.patch, > secondnamenode.txt > > Time Spent: 1.5h > Remaining Estimate: 0h > > In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, > because the possible exception (e.g., IOException) thrown in line 365 is > always ignored. > > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > class FSEditLogAsync extends FSEditLog implements Runnable { > // ... > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { >
[jira] [Work logged] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang
[ https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=585405=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585405 ] ASF GitHub Bot logged work on HDFS-15957: - Author: ASF GitHub Bot Created on: 19/Apr/21 22:17 Start Date: 19/Apr/21 22:17 Worklog Time Spent: 10m Work Description: functioner commented on a change in pull request #2878: URL: https://github.com/apache/hadoop/pull/2878#discussion_r616213541 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java ## @@ -378,13 +381,18 @@ public void logSyncWait() { @Override public void logSyncNotify(RuntimeException syncEx) { - try { -if (syncEx == null) { - call.sendResponse(); -} else { - call.abortResponse(syncEx); + for (int retries = 0; retries <= RESPONSE_SEND_RETRIES; retries++) { Review comment: @daryn-sharp Thanks for your explanation! Now I understand that: 1. If the response cannot be sent it's either because the connection is already closed or there's a bug preventing the encoding of the response. 2. If a speculated future bug left the connection open it's in an unknown/inconsistent state with possible partial data written so writing anything more is a corrupted or duplicate response for the client. And I have more questions: 1. What kind of messages will be sent in this specific `call.sendResponse()`? 2. Among these messages, are there critical ones? I mean, maybe some messages are important to the protocol and we don't want to lose them. If we delegate them to a dedicated RPC thread/service/framework, the disconnection/exception can be automatically handled so that those important messages can reliably reach the receiver's side. The `call.sendResponse()` is invoked after the edit log sync finishes, so the transaction has been done. However, it swallows the possible exception so it doesn't care whether it successfully replies (to datanode/client). Even if we have the implication of disconnection in this scenario, I was wondering if it's possible that a bug is hidden by this possibly ignored message? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585405) Time Spent: 1h 20m (was: 1h 10m) > The ignored IOException in the RPC response sent by FSEditLogAsync can cause > the HDFS client to hang > > > Key: HDFS-15957 > URL: https://issues.apache.org/jira/browse/HDFS-15957 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Critical > Labels: pull-request-available > Attachments: fsshell.txt, namenode.txt, reproduce.patch, > secondnamenode.txt > > Time Spent: 1h 20m > Remaining Estimate: 0h > > In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, > because the possible exception (e.g., IOException) thrown in line 365 is > always ignored. > > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > class FSEditLogAsync extends FSEditLog implements Runnable { > // ... > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx); // line > 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > // the calling rpc thread will
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585385=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585385 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 21:45 Start Date: 19/Apr/21 21:45 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822806901 > No, it's a non-blocking write so by definition it will never hang – unless induced by fault injection. @daryn-sharp I have considered this counterargument in https://github.com/apache/hadoop/pull/2737#issuecomment-822591028, where I proposed another argument that it may hang when it's a huge payload, because there's a while-loop. Please take a look. I'm not sure whether it can defend my argument. Can you provide more explanation? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585385) Time Spent: 6h 10m (was: 6h) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 6h 10m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable because FSEditLogAsync’s key feature is > asynchronous edit logging that is supposed to tolerate slow I/O. > To see why the sendResponse operation
[jira] [Commented] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325331#comment-17325331 ] Daryn Sharp commented on HDFS-15869: This is again a fault injection only issue. A non-blocking write will by definition not block. The cited ZKs issues appear to be regarding serialization + blocking write inside of a synchronized section. > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 6h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable because FSEditLogAsync’s key feature is > asynchronous edit logging that is supposed to tolerate slow I/O. > To see why the sendResponse operation in line 365 may get stuck, here is > the stack trace: > {code:java} > '(org.apache.hadoop.ipc.Server,channelWrite,3593)', > '(org.apache.hadoop.ipc.Server,access$1700,139)', > '(org.apache.hadoop.ipc.Server$Responder,processResponse,1657)', > '(org.apache.hadoop.ipc.Server$Responder,doRespond,1727)', > '(org.apache.hadoop.ipc.Server$Connection,sendResponse,2828)', > '(org.apache.hadoop.ipc.Server$Connection,access$300,1799)', > '(org.apache.hadoop.ipc.Server$RpcCall,doResponse,)', > '(org.apache.hadoop.ipc.Server$Call,doResponse,903)', > '(org.apache.hadoop.ipc.Server$Call,sendResponse,889)', > > '(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$RpcEdit,logSyncNotify,365)', > '(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync,run,248)', > '(java.lang.Thread,run,748)' > {code} > The `channelWrite` function is defined as follows: > {code:java} >
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585376=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585376 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 21:27 Start Date: 19/Apr/21 21:27 Worklog Time Spent: 10m Work Description: daryn-sharp commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822797022 bq. the FSEditLogAsync thread (without this patch) directly invokes a network I/O call call.sendResponse(), so if this network I/O invocation hangs, the FSEditLogAsync thread also hangs bq. My intention is to defend that we should not remove the references to "hanging" problems. No, it's a non-blocking write so by definition it will never hang – unless induced by fault injection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585376) Time Spent: 5h 50m (was: 5h 40m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 5h 50m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable because FSEditLogAsync’s key feature is > asynchronous edit logging that is supposed to tolerate slow I/O. > To see why the sendResponse operation in line 365 may get stuck, here is > the stack trace: >
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585377=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585377 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 21:27 Start Date: 19/Apr/21 21:27 Worklog Time Spent: 10m Work Description: daryn-sharp edited a comment on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822797022 > the FSEditLogAsync thread (without this patch) directly invokes a network I/O call call.sendResponse(), so if this network I/O invocation hangs, the FSEditLogAsync thread also hangs > My intention is to defend that we should not remove the references to "hanging" problems. No, it's a non-blocking write so by definition it will never hang – unless induced by fault injection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585377) Time Spent: 6h (was: 5h 50m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 6h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable because FSEditLogAsync’s key feature is > asynchronous edit logging that is supposed to tolerate slow I/O. > To see why the sendResponse operation in line 365 may get stuck, here is > the stack trace: >
[jira] [Work logged] (HDFS-15991) Add location into datanode info for NameNodeMXBean
[ https://issues.apache.org/jira/browse/HDFS-15991?focusedWorklogId=585348=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585348 ] ASF GitHub Bot logged work on HDFS-15991: - Author: ASF GitHub Bot Created on: 19/Apr/21 20:23 Start Date: 19/Apr/21 20:23 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2933: URL: https://github.com/apache/hadoop/pull/2933#issuecomment-822760304 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 38s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 34m 23s | | trunk passed | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 17s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 3s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 23s | | trunk passed | | +1 :green_heart: | javadoc | 0m 53s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 27s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 9s | | trunk passed | | +1 :green_heart: | shadedclient | 16m 16s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 12s | | the patch passed | | +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 12s | | the patch passed | | +1 :green_heart: | compile | 1m 6s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 6s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 53s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 11s | | the patch passed | | +1 :green_heart: | javadoc | 0m 44s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 20s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 7s | | the patch passed | | +1 :green_heart: | shadedclient | 15m 59s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 230m 18s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2933/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 43s | | The patch does not generate ASF License warnings. | | | | 317m 36s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2933/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2933 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell | | uname | Linux 90e25c73c94f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 31726f45042950c75bef119d72c4f13283453f15 | | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | Multi-JDK versions |
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585345=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585345 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 19:51 Start Date: 19/Apr/21 19:51 Worklog Time Spent: 10m Work Description: functioner edited a comment on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822713500 > Thanks @functioner > The detailed discussions (except the lambda argument) should have been on the Jira. > > > IMO, this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way. > > > So, call.sendResponse() (network service) affects FSEditLogAsync (edit log sync service). So, I would say it's a bug. > > Now, I am really even more confused about the (Bug Vs. Improvement). So, I am going to pass on reviewing. @amahussein Thanks for your feedback, and your time! Sorry for all the possible confusion I made. It's not a big deal whether it's marked as bug or improvement. One of my bug reports ([HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) is also finally marked as improvement rather than bug. The point is that the developers (in [HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) finally realized that there's a potential hanging issue as I point out, and the patch (as well as the relevant discussion) is very helpful for the developers and the users. > Ok, is the purpose of the change is to improve performance of the FSEditLogAsync.java by executing sendResponse() in parallel? > In that case, please change the title of the Jira and the description to remove references to "hanging" problems. My intention is to defend that we should not remove the references to "hanging" problems. In short, the discussion above can be summarized into 3 arguments: 1. https://github.com/apache/hadoop/pull/2737#issuecomment-822151838: this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way 2. https://github.com/apache/hadoop/pull/2737#issuecomment-822591028: the `call.sendResponse()` may hang due to network issue, without throwing any exception 3. https://github.com/apache/hadoop/pull/2737#issuecomment-822617097: a). the `FSEditLogAsync` thread (without this patch) directly invokes a network I/O call `call.sendResponse()`, so if this network I/O invocation hangs, the `FSEditLogAsync` thread also hangs b). in the "correct" system design, if this network I/O invocation hangs in this way, then that should be fine, because HDFS (as a fault-tolerant system) should tolerate it. c). when the system tolerates this network issue, the `FSEditLogAsync` thread should not hang, otherwise everybody can't commit the log. d). our expected behavior is that, when the system tolerates this network issue, the `FSEditLogAsync` thread should continue, so that everything still works well, despite this network issue. Both Argument 3 and Argument 1 can be resolved with this patch. In conclusion, this patch not only improves the performance, but also enhances the availability & fault-tolerance. So, I think the references to "hanging" problems should not be removed. If it keeps "Improvement" tag instead of "Bug" tag, that's fine. P.S. I will summarize our discussion with a comment in Jira after we reach a consensus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585345) Time Spent: 5h 40m (was: 5.5h) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 5h 40m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585330=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585330 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 19:14 Start Date: 19/Apr/21 19:14 Worklog Time Spent: 10m Work Description: functioner edited a comment on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822713500 > Thanks @functioner > The detailed discussions (except the lambda argument) should have been on the Jira. > > > IMO, this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way. > > > So, call.sendResponse() (network service) affects FSEditLogAsync (edit log sync service). So, I would say it's a bug. > > Now, I am really even more confused about the (Bug Vs. Improvement). So, I am going to pass on reviewing. @amahussein Thanks for your feedback, and your time! Sorry for all the possible confusion I made. It's not a big deal whether it's marked as bug or improvement. One of my bug reports ([HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) is also finally marked as improvement rather than bug. The point is that the developers (in [HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) finally realized that there's a potential hanging issue as I point out, and the patch (as well as the relevant discussion) is very helpful for the developers and the users. > Ok, is the purpose of the change is to improve performance of the FSEditLogAsync.java by executing sendResponse() in parallel? > In that case, please change the title of the Jira and the description to remove references to "hanging" problems. My intention is to defend that we should not remove the references to "hanging" problems. In short, the discussion above can be summarized into 3 arguments: 1. https://github.com/apache/hadoop/pull/2737#issuecomment-822151838: this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way 2. https://github.com/apache/hadoop/pull/2737#issuecomment-822591028: the `call.sendResponse()` may hang due to network issue, without throwing any exception 3. https://github.com/apache/hadoop/pull/2737#issuecomment-822617097: a). the `FSEditLogAsync` thread (without this patch) directly invokes a network I/O call `call.sendResponse()`, so if this network I/O invocation hangs, the `FSEditLogAsync` thread also hangs b). in the "correct" system design, if this network I/O invocation hangs in this way, then that should be fine, because HDFS (as a fault-tolerant system) should tolerate it. c). when the system tolerates this network issue, the `FSEditLogAsync` thread should not hang, otherwise everybody can't commit the log. d). our expected behavior is that, when the system tolerates this network issue, the `FSEditLogAsync` thread should continue, so that everything still works well, despite this network issue. Both Argument 3 and Argument 1 can be resolved with this patch. In conclusion, this patch not only improves the performance, but also enhances the availability & fault-tolerance. So, I think the references to "hanging" problems should not be removed. If it keeps "Improvement" tag instead of "Bug" tag, I won't disagree with it. P.S. I will summarize our discussion with a comment in Jira after we reach a consensus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585330) Time Spent: 5.5h (was: 5h 20m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 5.5h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585328=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585328 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 19:07 Start Date: 19/Apr/21 19:07 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822713500 > Thanks @functioner > The detailed discussions (except the lambda argument) should have been on the Jira. > > > IMO, this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way. > > > So, call.sendResponse() (network service) affects FSEditLogAsync (edit log sync service). So, I would say it's a bug. > > Now, I am really even more confused about the (Bug Vs. Improvement). So, I am going to pass on reviewing. @amahussein Thanks for your feedback, and your time! Sorry for all the possible confusion I made. It's not a big deal whether it's marked as bug or improvement. One of my bug reports ([HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) is also finally marked as improvement rather than bug. The point is that the developers (in [HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) finally realized that there's a potential hanging issue as I point out, and the patch (as well as the relevant discussion) is very helpful for the developers and the users. > Ok, is the purpose of the change is to improve performance of the FSEditLogAsync.java by executing sendResponse() in parallel? > In that case, please change the title of the Jira and the description to remove references to "hanging" problems. My intention is to defend that we should not remove the references to "hanging" problems. In short, the discussion above can be summarized into 3 arguments: 1. https://github.com/apache/hadoop/pull/2737#issuecomment-822151838: this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way 2. https://github.com/apache/hadoop/pull/2737#issuecomment-822591028: the `call.sendResponse()` may hang due to network issue, without throwing any exception 3. https://github.com/apache/hadoop/pull/2737#issuecomment-822617097: a). the `FSEditLogAsync` thread (without this patch) directly invokes a network I/O call `call.sendResponse()`, so if this network I/O invocation hangs, the `FSEditLogAsync` thread also hangs b). in the "correct" system design, if this network I/O invocation hangs in this way, then that should be fine, because HDFS (as a fault-tolerant system) should tolerate it. c). when the system tolerates this network issue, the `FSEditLogAsync` thread should not hang, otherwise everybody can't commit the log. d). our expected behavior is that, when the system tolerates this network issue, the `FSEditLogAsync` thread should continue, so that everything still works well, despite this network issue. Both Argument 3 and Argument 1 can be resolved with this patch. In conclusion, this patch not only improves the performance, but also enhances the availability & fault-tolerance. So, I think the references to "hanging" problems should not be removed. P.S. I will summarize our discussion with a comment in Jira after we reach a consensus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585328) Time Spent: 5h 20m (was: 5h 10m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 5h 20m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585277=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585277 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 17:36 Start Date: 19/Apr/21 17:36 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822649368 > Thanks @functioner > The detailed discussions (except the lambda argument) should have been on the Jira. I see. I will make a comment of summary in Jira after the discussion in this PR is finalized. > > > If any concerns about lambda expression, we could improve it rather than reject it directly. > > > > > > @amahussein A common way to eliminate such overhead is preparing multiple consumer threads, and feed them with requests. > > If the lambda expressions cause significant overhead, we can improve in that way. > > This design pattern is widely used in Cassandra. Example: SEPWorker - SEPExecutor > > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPWorker.java > > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java > > This is not what I meant. It is recommended to avoid use of lambda expressions in hot execution paths. Actually we are on the same page. Maybe my comment has made some confusion. > There are so many ways to avoid lambda expressions simply by having runnables waiting for tasks to be added to a queue. That's exactly what I meant. I will push a commit soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585277) Time Spent: 5h 10m (was: 5h) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > //
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585274=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585274 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 17:23 Start Date: 19/Apr/21 17:23 Worklog Time Spent: 10m Work Description: amahussein edited a comment on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822639062 Thanks @functioner The detailed discussions (except the lambda argument) should have been on the Jira. > IMO, this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way. > So, call.sendResponse() (network service) affects FSEditLogAsync (edit log sync service). So, I would say it's a bug. Now, I am really even more confused about the (Bug Vs. Improvement). So, I am going to pass on reviewing. > > If any concerns about lambda expression, we could improve it rather than reject it directly. > > @amahussein A common way to eliminate such overhead is preparing multiple consumer threads, and feed them with requests. > If the lambda expressions cause significant overhead, we can improve in that way. > This design pattern is widely used in Cassandra. Example: SEPWorker - SEPExecutor > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPWorker.java > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java This is not what I meant. It is recommended to avoid use of lambda expressions in hot execution paths. There are so many ways to avoid lambda expressions simply by having runnables waiting for tasks to be added to a queue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585274) Time Spent: 5h (was: 4h 50m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 5h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} >
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585271=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585271 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 17:20 Start Date: 19/Apr/21 17:20 Worklog Time Spent: 10m Work Description: amahussein commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822639062 Thanks @functioner The detailed discussions (except the lambda argument) should have been on the Jira. Now, I am really even more confused about the (Bug Vs. Improvement). So, I am going to pass on reviewing. > > If any concerns about lambda expression, we could improve it rather than reject it directly. > > @amahussein A common way to eliminate such overhead is preparing multiple consumer threads, and feed them with requests. > If the lambda expressions cause significant overhead, we can improve in that way. > This design pattern is widely used in Cassandra. Example: SEPWorker - SEPExecutor > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPWorker.java > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java This is not what I meant. It is recommended to avoid use of lambda expressions in hot execution paths. There are so many ways to avoid lambda expressions simply by having runnables waiting for tasks to be added to a queue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585271) Time Spent: 4h 50m (was: 4h 40m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 4h 50m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { >
[jira] [Updated] (HDFS-15561) RBF: Fix NullPointException when start dfsrouter
[ https://issues.apache.org/jira/browse/HDFS-15561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengnan Li updated HDFS-15561: -- Summary: RBF: Fix NullPointException when start dfsrouter (was: Fix NullPointException when start dfsrouter) > RBF: Fix NullPointException when start dfsrouter > > > Key: HDFS-15561 > URL: https://issues.apache.org/jira/browse/HDFS-15561 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.3.0 >Reporter: Xie Lei >Assignee: Fengnan Li >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > when start dfsrouter, it throw NPE > {code:java} > 2020-09-08 19:41:14,989 ERROR > org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService: > Unexpected exception while communicating with null:null: > java.net.UnknownHostException: null2020-09-08 19:41:14,989 ERROR > org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService: > Unexpected exception while communicating with null:null: > java.net.UnknownHostException: nulljava.lang.IllegalArgumentException: > java.net.UnknownHostException: null at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:447) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:171) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:123) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:95) > at > org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.getNamenodeStatusReport(NamenodeHeartbeatService.java:248) > at > org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:205) > at > org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:159) > at > org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) > at > java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) > at java.base/java.lang.Thread.run(Thread.java:844)Caused by: > java.net.UnknownHostException: null ... 14 more > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585241=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585241 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 16:46 Start Date: 19/Apr/21 16:46 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822617097 > > In that case, please change the title of the Jira and the description to remove references to "hanging" problems. > > @amahussein I still would like to argue about this "hanging" issue. Another aspect of the argument is the design of availability and fault tolerance. Actually distributed systems can tolerate such hanging issues in many scenarios, but sometimes it's seen as a bug like [ZOOKEEPER-2201](https://issues.apache.org/jira/browse/ZOOKEEPER-2201). So an important question is: when it's a bug; and when it's not (i.e., it's a feature) I've been doing research on fault injection for some time and I have submitted multiple bug reports accepted by the open source community (e.g., [HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)). My criteria for determining whether it is bug, are: 1. if we inject a fault in **module X** and it only affects **module X**, then it’s not a bug. 2. if we inject a fault in **module X** and it affects not only **module X** but also **module Y** which should not relate to **module X**, then probably it would be a bug, because in the system design, each module should be responsible for itself and report the problem (e.g., by logging), rather than affect another irrelevant module. In our scenario ([HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869)), this possible hanging (_if you agree with my argument of network hanging_) can block the `FSEditLogAsync` thread, because now `call.sendResponse()` is invoked by the `FSEditLogAsync` thread. So, `call.sendResponse()` (network service) affects `FSEditLogAsync` (edit log sync service). So, I would say it's a bug. The network service should be responsible for all its behaviors, and handle all the possible network issues (e.g., IOException, disconnection, hanging). It should determine how to handle them, e.g., by logging the error, rather than affecting other services like `FSEditLogAsync`. I'm not saying that we have to use a complete and slow RPC framework for this network service. But IMO, decoupling it from `FSEditLogAsync` by delegating to a thread pool is at least a better design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585241) Time Spent: 4h 40m (was: 4.5h) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 4h 40m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need
[jira] [Work logged] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds
[ https://issues.apache.org/jira/browse/HDFS-15810?focusedWorklogId=585234=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585234 ] ASF GitHub Bot logged work on HDFS-15810: - Author: ASF GitHub Bot Created on: 19/Apr/21 16:35 Start Date: 19/Apr/21 16:35 Worklog Time Spent: 10m Work Description: goiri commented on a change in pull request #2910: URL: https://github.com/apache/hadoop/pull/2910#discussion_r616004081 ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/metrics/TestRBFMetrics.java ## @@ -18,10 +18,7 @@ package org.apache.hadoop.hdfs.server.federation.metrics; import static org.apache.hadoop.hdfs.server.federation.FederationTestUtils.getBean; -import static org.junit.Assert.assertEquals; Review comment: Keep extended. ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/metrics/TestRBFMetrics.java ## @@ -351,21 +349,32 @@ private void validateClusterStatsRouterBean(RouterMBean bean) { assertFalse(bean.isSecurityEnabled()); } - private void testCapacity(FederationMBean bean) { + private void testCapacity(FederationMBean bean) throws IOException { Review comment: Too many spaces. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585234) Time Spent: 1h 40m (was: 1.5h) > RBF: RBFMetrics's TotalCapacity out of bounds > - > > Key: HDFS-15810 > URL: https://issues.apache.org/jira/browse/HDFS-15810 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Xiaoxing Wei >Assignee: Fengnan Li >Priority: Major > Labels: pull-request-available > Attachments: image-2021-02-02-10-59-17-113.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in > RBFMetrics maybe ** out of bounds. > !image-2021-02-02-10-59-17-113.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15878) RBF: Flaky test TestRouterWebHDFSContractCreate>AbstractContractCreateTest#testSyncable in Trunk
[ https://issues.apache.org/jira/browse/HDFS-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-15878: --- Summary: RBF: Flaky test TestRouterWebHDFSContractCreate>AbstractContractCreateTest#testSyncable in Trunk (was: Flaky test TestRouterWebHDFSContractCreate>AbstractContractCreateTest#testSyncable in Trunk) > RBF: Flaky test > TestRouterWebHDFSContractCreate>AbstractContractCreateTest#testSyncable in > Trunk > > > Key: HDFS-15878 > URL: https://issues.apache.org/jira/browse/HDFS-15878 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs, rbf >Reporter: Renukaprasad C >Assignee: Fengnan Li >Priority: Major > > ERROR] Tests run: 16, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: > 24.627 s <<< FAILURE! - in > org.apache.hadoop.fs.contract.router.web.TestRouterWebHDFSContractCreate > [ERROR] > testSyncable(org.apache.hadoop.fs.contract.router.web.TestRouterWebHDFSContractCreate) > Time elapsed: 0.222 s <<< ERROR! > java.io.FileNotFoundException: File /test/testSyncable not found. > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:576) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$900(WebHdfsFileSystem.java:146) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:892) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:858) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:652) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:690) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:686) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$ReadRunner.getRedirectedUrl(WebHdfsFileSystem.java:2307) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$ReadRunner.(WebHdfsFileSystem.java:2296) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$WebHdfsInputStream.(WebHdfsFileSystem.java:2176) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.open(WebHdfsFileSystem.java:1610) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:975) > at > org.apache.hadoop.fs.contract.AbstractContractCreateTest.validateSyncableSemantics(AbstractContractCreateTest.java:556) > at > org.apache.hadoop.fs.contract.AbstractContractCreateTest.testSyncable(AbstractContractCreateTest.java:459) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > Caused by: >
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585222=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585222 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 16:09 Start Date: 19/Apr/21 16:09 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822591028 > In that case, please change the title of the Jira and the description to remove references to "hanging" problems. @amahussein I still would like to argue about this "hanging" issue. There has been reported TCP network I/O issues which hangs for >15min without throwing any exception. [ZOOKEEPER-2201](https://issues.apache.org/jira/browse/ZOOKEEPER-2201) is a perfect example, and you can find the TCP level explanation for this hanging issue in https://www.usenix.org/conference/srecon16/program/presentation/nadolny Similar hanging bugs are also accepted by ZooKeeper community, such as: - [ZOOKEEPER-3531](https://issues.apache.org/jira/browse/ZOOKEEPER-3531): very similar to ZK-2201; the patch is merged - [ZOOKEEPER-4074](https://issues.apache.org/jira/browse/ZOOKEEPER-4074): a similar network hanging bug I reported; already confirmed by community; more discussion can be found in https://github.com/apache/zookeeper/pull/1582 However, in our scenario ([HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869)), a possible counterargument is: the `call.sendResponse()` invocation eventually invokes `channel.write(buffer)` (line 3611), which is non-blocking mode, so it might not be affected by this potential issue. https://github.com/apache/hadoop/blob/3c57512d104e3a92391c9a03ce4005a00267c07f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3607-L3616 However, as we point out in [HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869), when the payload is huge, in line 3611, it won't invoke `channel.write(buffer)`; instead, it invokes `channelIO(null, channel, buffer)` which brings us to: https://github.com/apache/hadoop/blob/3c57512d104e3a92391c9a03ce4005a00267c07f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3646-L3672 If the payload is split in two batches, the second batch will have to wait for the first batch to be sent out, which may encounter high packet loss rate and thus slow I/O. Hence, I would say the hanging problem still exists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585222) Time Spent: 4.5h (was: 4h 20m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 4.5h > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { >
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585193=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585193 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 15:34 Start Date: 19/Apr/21 15:34 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822564254 > If any concerns about lambda expression, we could improve it rather than reject it directly. @amahussein A common way to eliminate such overhead is preparing multiple consumer threads, and feed them with requests. If the lambda expressions cause significant overhead, we can improve in that way. This design pattern is widely used in Cassandra. Example: SEPWorker - SEPExecutor https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPWorker.java https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585193) Time Spent: 4h 20m (was: 4h 10m) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 4h 20m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx);// > line 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > {code} > In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is > essentially doing some network write (line 365). > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse(); // line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > // ... > }{code} > If the sendResponse operation in line 365 gets stuck, then the whole > FSEditLogAsync thread is not able to proceed. In this case, the critical > logSync (line 243) can’t be executed, for the incoming transactions. Then the > namenode hangs. This is undesirable
[jira] [Work logged] (HDFS-15991) Add location into datanode info for NameNodeMXBean
[ https://issues.apache.org/jira/browse/HDFS-15991?focusedWorklogId=585186=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585186 ] ASF GitHub Bot logged work on HDFS-15991: - Author: ASF GitHub Bot Created on: 19/Apr/21 15:04 Start Date: 19/Apr/21 15:04 Worklog Time Spent: 10m Work Description: tomscut opened a new pull request #2933: URL: https://github.com/apache/hadoop/pull/2933 JIRA: [HDFS-15991](https://issues.apache.org/jira/browse/HDFS-15991) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585186) Remaining Estimate: 0h Time Spent: 10m > Add location into datanode info for NameNodeMXBean > -- > > Key: HDFS-15991 > URL: https://issues.apache.org/jira/browse/HDFS-15991 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Add location into datanode info for NameNodeMXBean. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15991) Add location into datanode info for NameNodeMXBean
[ https://issues.apache.org/jira/browse/HDFS-15991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-15991: -- Labels: pull-request-available (was: ) > Add location into datanode info for NameNodeMXBean > -- > > Key: HDFS-15991 > URL: https://issues.apache.org/jira/browse/HDFS-15991 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Add location into datanode info for NameNodeMXBean. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15991) Add location into datanode info for NameNodeMXBean
tomscut created HDFS-15991: -- Summary: Add location into datanode info for NameNodeMXBean Key: HDFS-15991 URL: https://issues.apache.org/jira/browse/HDFS-15991 Project: Hadoop HDFS Issue Type: Wish Reporter: tomscut Assignee: tomscut Add location into datanode info for NameNodeMXBean. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15979) Move within EZ fails and cannot remove nested EZs
[ https://issues.apache.org/jira/browse/HDFS-15979?focusedWorklogId=585177=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585177 ] ASF GitHub Bot logged work on HDFS-15979: - Author: ASF GitHub Bot Created on: 19/Apr/21 14:46 Start Date: 19/Apr/21 14:46 Worklog Time Spent: 10m Work Description: amahussein commented on pull request #2919: URL: https://github.com/apache/hadoop/pull/2919#issuecomment-822525000 The failures are not related. Those were the intermittent failures reported in the daily reports and existing jiras are addressing them. > The changes were contributed by Daryn Sharp and we have our internal clusters running on those changes with hadoop-2.8 and hadoop-2.10. @jojochuang Do you have any feedback on those changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585177) Time Spent: 0.5h (was: 20m) > Move within EZ fails and cannot remove nested EZs > - > > Key: HDFS-15979 > URL: https://issues.apache.org/jira/browse/HDFS-15979 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption, hdfs >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15979.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Moving between EZ directories should work fine if the EZ key for the > directories is identical. If the key is name identical then no > decrypt/re-encrypt is necessary. > However, the rename operation checks more than the key name. It compares the > inode number (unique identifier) of the source and dest dirs which will never > be the same for 2 dirs resulting in the cited failure. Note it also > incorrectly compares the key version. > A related issue is if an ancestor of a EZ share the same key (ie. > /projects/foo and /projects/foo/bar/blah both use same key), files also > cannot be moved from the child to a parent dir, plus the child EZ cannot be > removed even though it's now covered by the ancestor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=585174=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585174 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 19/Apr/21 14:36 Start Date: 19/Apr/21 14:36 Worklog Time Spent: 10m Work Description: tasanuma commented on pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#issuecomment-822517515 Thanks for updating it. +1, pending Jenkins. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585174) Time Spent: 3h 20m (was: 3h 10m) > Split TestBalancer into two classes > --- > > Key: HDFS-15989 > URL: https://issues.apache.org/jira/browse/HDFS-15989 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > TestBalancer has many tests accumulated, it would be good to split it up into > two classes. Moreover, TestBalancer#testMaxIterationTime is flaky. We should > also resolve it with this Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15970) Print network topology on the web
[ https://issues.apache.org/jira/browse/HDFS-15970?focusedWorklogId=585172=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585172 ] ASF GitHub Bot logged work on HDFS-15970: - Author: ASF GitHub Bot Created on: 19/Apr/21 14:35 Start Date: 19/Apr/21 14:35 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #2896: URL: https://github.com/apache/hadoop/pull/2896#issuecomment-822516400 Thanks a lot. @tasanuma @goiri @ayushtkn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585172) Time Spent: 3h 20m (was: 3h 10m) > Print network topology on the web > - > > Key: HDFS-15970 > URL: https://issues.apache.org/jira/browse/HDFS-15970 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0 > > Attachments: hdfs-topology-json.jpg, hdfs-topology.jpg, hdfs-web.jpg > > Time Spent: 3h 20m > Remaining Estimate: 0h > > In order to query the network topology information conveniently, we can print > it on the web. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15869) Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang
[ https://issues.apache.org/jira/browse/HDFS-15869?focusedWorklogId=585171=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585171 ] ASF GitHub Bot logged work on HDFS-15869: - Author: ASF GitHub Bot Created on: 19/Apr/21 14:27 Start Date: 19/Apr/21 14:27 Worklog Time Spent: 10m Work Description: amahussein commented on pull request #2737: URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822510445 > @amahussein Thanks for your quick response. I think this is not the same concept/issue between [HDFS-15957](https://issues.apache.org/jira/browse/HDFS-15957) and [HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869). I have leave comment at [HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869) and suggest to change it to `improvement` rather than `bug fix`; (cc @functioner) > IMO, this is a classic Producer-Consumer problem, and it is natural idea to improve performance using parallel way. And Yiqun has reported the same issue at [HDFS-15486](https://issues.apache.org/jira/browse/HDFS-15486). In my own production env it was about 5% E2E saving time for write operation. > My suggestion, > A. Update description for improvement rather than bug fix. > B. If any concerns about lambda expression, we could improve it rather than reject it directly. > Welcome any more discussion. Thanks everyone here. Thanks @Hexiaoqiao Hexiaoqiao for the comment. Ok, is the purpose of the change is to improve performance of the `FSEditLogAsync.java` by executing `sendResponse()` in parallel? In that case, please change the title of the Jira and the description to remove references to "hanging" problems. Then I will take another look. I am sorry for the inconvenience as I want to make sure I understand the purpose of the change before revieweing. > @amahussein Thanks for the comment. > Can I send an email to you to explain more about the issue? @Hexiaoqiao and I have some more discussion on it, and some discussion is inconvenient to put in here. You can contact me via [oier...@gmail.com](mailto:oier...@gmail.com) or [ha...@jhu.edu](mailto:ha...@jhu.edu) and then I will reply. Thanks @functioner ! I really appreciate that. I think @Hexiaoqiao Hexiaoqiao reply already clarified some of the confusion about the scope of the work. Please feel free to reach me through email at anytime. I am on the common-dev mailing list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585171) Time Spent: 4h 10m (was: 4h) > Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can > cause the namenode to hang > > > Key: HDFS-15869 > URL: https://issues.apache.org/jira/browse/HDFS-15869 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Assignee: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 4h 10m > Remaining Estimate: 0h > > We were doing some testing of the latest Hadoop stable release 3.2.2 and > found some network issue can cause the namenode to hang even with the async > edit logging (FSEditLogAsync). > The workflow of the FSEditLogAsync thread is basically: > # get EditLog from a queue (line 229) > # do the transaction (line 232) > # sync the log if doSync (line 243) > # do logSyncNotify (line 248) > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); // > line 229 > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); // > line 232 > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); // > line 243 >
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325077#comment-17325077 ] Hadoop QA commented on HDFS-15973: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 43s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 47s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 3s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 16m 49s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 10s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 40s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} the patch passed with
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=585164=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585164 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 19/Apr/21 14:20 Start Date: 19/Apr/21 14:20 Worklog Time Spent: 10m Work Description: virajjasani commented on a change in pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#discussion_r615884691 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer2.java ## @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdfs.server.balancer; + +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.StorageType; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSClient; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.DFSTestUtil; +import org.apache.hadoop.hdfs.DFSUtil; +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.HdfsConfiguration; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.NameNodeProxies; +import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys; +import org.apache.hadoop.hdfs.protocol.ClientProtocol; +import org.apache.hadoop.hdfs.protocol.DatanodeID; +import org.apache.hadoop.hdfs.protocol.DatanodeInfo; +import org.apache.hadoop.hdfs.protocol.HdfsConstants; +import org.apache.hadoop.hdfs.protocol.LocatedBlock; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementStatus; +import org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager; +import org.apache.hadoop.hdfs.server.datanode.DataNode; +import org.apache.hadoop.hdfs.server.datanode.DataNodeTestUtils; +import org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset; +import org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.test.GenericTestUtils; +import org.junit.After; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.slf4j.event.Level; + +import java.io.OutputStream; +import java.net.InetSocketAddress; +import java.net.URI; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import static org.apache.hadoop.fs.StorageType.DEFAULT; +import static org.apache.hadoop.fs.StorageType.RAM_DISK; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BLOCK_SIZE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_BLOCK_PINNING_ENABLED; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_LAZY_WRITER_INTERVAL_SEC; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_MAX_LOCKED_MEMORY_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_LAZY_PERSIST_FILE_SCRUB_INTERVAL_SEC; +import static org.apache.hadoop.test.PlatformAssumptions.assumeNotWindows; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +/** + * Some long running Balancer tasks. + */ +public class TestBalancer2 { + + private static final Logger LOG = + LoggerFactory.getLogger(TestBalancer2.class); + + static { +GenericTestUtils.setLogLevel(Balancer.LOG, Level.TRACE); +GenericTestUtils.setLogLevel(Dispatcher.LOG, Level.DEBUG); + } + + private final static long CAPACITY = 5000L; + private final static String RACK0 = "/rack0"; +
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=585143=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585143 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 19/Apr/21 14:04 Start Date: 19/Apr/21 14:04 Worklog Time Spent: 10m Work Description: tasanuma commented on pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#issuecomment-822493305 We can remove `DEFAULT_RAM_DISK_BLOCK_SIZE` in TestBalancer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585143) Time Spent: 3h (was: 2h 50m) > Split TestBalancer into two classes > --- > > Key: HDFS-15989 > URL: https://issues.apache.org/jira/browse/HDFS-15989 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > TestBalancer has many tests accumulated, it would be good to split it up into > two classes. Moreover, TestBalancer#testMaxIterationTime is flaky. We should > also resolve it with this Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=585142=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585142 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 19/Apr/21 14:03 Start Date: 19/Apr/21 14:03 Worklog Time Spent: 10m Work Description: tasanuma commented on a change in pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#discussion_r615876497 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer2.java ## @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdfs.server.balancer; + +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.StorageType; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSClient; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.DFSTestUtil; +import org.apache.hadoop.hdfs.DFSUtil; +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.HdfsConfiguration; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.NameNodeProxies; +import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys; +import org.apache.hadoop.hdfs.protocol.ClientProtocol; +import org.apache.hadoop.hdfs.protocol.DatanodeID; +import org.apache.hadoop.hdfs.protocol.DatanodeInfo; +import org.apache.hadoop.hdfs.protocol.HdfsConstants; +import org.apache.hadoop.hdfs.protocol.LocatedBlock; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementStatus; +import org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager; +import org.apache.hadoop.hdfs.server.datanode.DataNode; +import org.apache.hadoop.hdfs.server.datanode.DataNodeTestUtils; +import org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset; +import org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.test.GenericTestUtils; +import org.junit.After; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.slf4j.event.Level; + +import java.io.OutputStream; +import java.net.InetSocketAddress; +import java.net.URI; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import static org.apache.hadoop.fs.StorageType.DEFAULT; +import static org.apache.hadoop.fs.StorageType.RAM_DISK; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BLOCK_SIZE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_BLOCK_PINNING_ENABLED; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_LAZY_WRITER_INTERVAL_SEC; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_MAX_LOCKED_MEMORY_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_LAZY_PERSIST_FILE_SCRUB_INTERVAL_SEC; +import static org.apache.hadoop.test.PlatformAssumptions.assumeNotWindows; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +/** + * Some long running Balancer tasks. + */ +public class TestBalancer2 { + + private static final Logger LOG = + LoggerFactory.getLogger(TestBalancer2.class); + + static { +GenericTestUtils.setLogLevel(Balancer.LOG, Level.TRACE); +GenericTestUtils.setLogLevel(Dispatcher.LOG, Level.DEBUG); + } + + private final static long CAPACITY = 5000L; + private final static String RACK0 = "/rack0"; + private
[jira] [Commented] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
[ https://issues.apache.org/jira/browse/HDFS-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325056#comment-17325056 ] Stephen O'Donnell commented on HDFS-15990: -- The reason the sub-sections are written here, is so we can add the ability to load them in parallel later (like with the inodes and directory sections). However so far nobody has started that work. > No need to write to sub_section when serialize SnapshotDiff Section > --- > > Key: HDFS-15990 > URL: https://issues.apache.org/jira/browse/HDFS-15990 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: Xuze Yang >Priority: Minor > > In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code > exists: > {code:java} > if (i % parent.getInodesPerSubSection() == 0) { > parent.commitSubSection(headers, > FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); > }{code} > It aims to serialize SnapshotDiff information into several sub_sections(i.e. > additional sub_sections information will be written to FileSummary Section). > But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats > SnapshotDiffSection as a whole, rather than several sub_sections. So it's no > need to introduce sub_sections here. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325054#comment-17325054 ] Wei-Chiu Chuang commented on HDFS-15796: [~Daniel Ma] ping. Please let us know more details. Meanwhile I updated target version to 3.4.0. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15796: --- Target Version/s: 3.4.0 (was: 3.3.1) > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15796: --- Fix Version/s: (was: 3.1.1) > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325026#comment-17325026 ] Jinglun commented on HDFS-15973: Hi [~zhengzhuobinzzb], thanks your comments ! The security mode is not considered in v03, thanks your explanation ! Submit v04. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.004.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
[ https://issues.apache.org/jira/browse/HDFS-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuze Yang updated HDFS-15990: - Priority: Minor (was: Major) > No need to write to sub_section when serialize SnapshotDiff Section > --- > > Key: HDFS-15990 > URL: https://issues.apache.org/jira/browse/HDFS-15990 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: Xuze Yang >Priority: Minor > > In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code > exists: > {code:java} > if (i % parent.getInodesPerSubSection() == 0) { > parent.commitSubSection(headers, > FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); > }{code} > It aims to serialize SnapshotDiff information into several sub_sections(i.e. > additional sub_sections information will be written to FileSummary Section). > But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats > SnapshotDiffSection as a whole, rather than several sub_sections. So it's no > need to introduce sub_sections here. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
[ https://issues.apache.org/jira/browse/HDFS-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuze Yang updated HDFS-15990: - Description: In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code exists: {code:java} if (i % parent.getInodesPerSubSection() == 0) { parent.commitSubSection(headers, FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); }{code} It aims to serialize SnapshotDiff information into several sub_sections(i.e. additional sub_sections information will be written to FileSummary Section). But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats SnapshotDiffSection as a whole, rather than several sub_sections. So it's no need to introduce sub_sections here. was: In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code exists: {quote}if (i % parent.getInodesPerSubSection() == 0) { parent.commitSubSection(headers, FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); }{quote} It aims to serialize SnapshotDiff information into several sub_sections(i.e. additional sub_sections information will be written to FileSummary Section). But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats SnapshotDiffSection as a whole, rather than several sub_sections. So it's no need to introduce sub_sections here. > No need to write to sub_section when serialize SnapshotDiff Section > --- > > Key: HDFS-15990 > URL: https://issues.apache.org/jira/browse/HDFS-15990 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: Xuze Yang >Priority: Major > > In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code > exists: > {code:java} > if (i % parent.getInodesPerSubSection() == 0) { > parent.commitSubSection(headers, > FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); > }{code} > It aims to serialize SnapshotDiff information into several sub_sections(i.e. > additional sub_sections information will be written to FileSummary Section). > But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats > SnapshotDiffSection as a whole, rather than several sub_sections. So it's no > need to introduce sub_sections here. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
[ https://issues.apache.org/jira/browse/HDFS-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuze Yang updated HDFS-15990: - Description: In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code exists: {quote}if (i % parent.getInodesPerSubSection() == 0) { parent.commitSubSection(headers, FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); }{quote} It aims to serialize SnapshotDiff information into several sub_sections(i.e. additional sub_sections information will be written to FileSummary Section). But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats SnapshotDiffSection as a whole, rather than several sub_sections. So it's no need to introduce sub_sections here. was: In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code exists: {quote}if (i % parent.getInodesPerSubSection() == 0){ parent.commitSubSection(headers, FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); } {quote} It aims to serialize SnapshotDiff information into several sub_sections(i.e. additional sub_sections information will be written to FileSummary Section). But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats SnapshotDiffSection as a whole, rather than several sub_sections. So it's no need to introduce sub_sections here. > No need to write to sub_section when serialize SnapshotDiff Section > --- > > Key: HDFS-15990 > URL: https://issues.apache.org/jira/browse/HDFS-15990 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: Xuze Yang >Priority: Major > > In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code > exists: > {quote}if (i % parent.getInodesPerSubSection() == 0) { > parent.commitSubSection(headers, > FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); > }{quote} > It aims to serialize SnapshotDiff information into several sub_sections(i.e. > additional sub_sections information will be written to FileSummary Section). > But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats > SnapshotDiffSection as a whole, rather than several sub_sections. So it's no > need to introduce sub_sections here. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
[ https://issues.apache.org/jira/browse/HDFS-15990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuze Yang updated HDFS-15990: - Description: In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code exists: {quote}if (i % parent.getInodesPerSubSection() == 0){ parent.commitSubSection(headers, FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); } {quote} It aims to serialize SnapshotDiff information into several sub_sections(i.e. additional sub_sections information will be written to FileSummary Section). But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats SnapshotDiffSection as a whole, rather than several sub_sections. So it's no need to introduce sub_sections here. was: In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code exists: {quote}if (i % parent.getInodesPerSubSection() == 0) { parent.commitSubSection(headers, FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); } {quote} It aims to serialize SnapshotDiff information into several sub_sections(i.e. additional sub_sections information will be written to FileSummary Section). But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats SnapshotDiffSection as a whole, rather than several sub_sections. So it's no need to introduce sub_sections here. > No need to write to sub_section when serialize SnapshotDiff Section > --- > > Key: HDFS-15990 > URL: https://issues.apache.org/jira/browse/HDFS-15990 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: Xuze Yang >Priority: Major > > In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code > exists: > {quote}if (i % parent.getInodesPerSubSection() == 0){ > parent.commitSubSection(headers, > FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); > } > {quote} > It aims to serialize SnapshotDiff information into several sub_sections(i.e. > additional sub_sections information will be written to FileSummary Section). > But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats > SnapshotDiffSection as a whole, rather than several sub_sections. So it's no > need to introduce sub_sections here. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15990) No need to write to sub_section when serialize SnapshotDiff Section
Xuze Yang created HDFS-15990: Summary: No need to write to sub_section when serialize SnapshotDiff Section Key: HDFS-15990 URL: https://issues.apache.org/jira/browse/HDFS-15990 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.10.0 Reporter: Xuze Yang In FSImageFormatPBSnapshot.serializeSnapshotDiffSection(), the following code exists: {quote}if (i % parent.getInodesPerSubSection() == 0) { parent.commitSubSection(headers, FSImageFormatProtobuf.SectionName.SNAPSHOT_DIFF_SUB); } {quote} It aims to serialize SnapshotDiff information into several sub_sections(i.e. additional sub_sections information will be written to FileSummary Section). But in FSImageFormatPBSnapshot.loadSnapshotDiffSection(), it treats SnapshotDiffSection as a whole, rather than several sub_sections. So it's no need to introduce sub_sections here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: (was: HDFS-15973.004.patch) > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.004.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15970) Print network topology on the web
[ https://issues.apache.org/jira/browse/HDFS-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takanobu Asanuma updated HDFS-15970: Fix Version/s: 3.3.1 > Print network topology on the web > - > > Key: HDFS-15970 > URL: https://issues.apache.org/jira/browse/HDFS-15970 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0 > > Attachments: hdfs-topology-json.jpg, hdfs-topology.jpg, hdfs-web.jpg > > Time Spent: 3h 10m > Remaining Estimate: 0h > > In order to query the network topology information conveniently, we can print > it on the web. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15811) completeFile should log final file size
[ https://issues.apache.org/jira/browse/HDFS-15811?focusedWorklogId=585084=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585084 ] ASF GitHub Bot logged work on HDFS-15811: - Author: ASF GitHub Bot Created on: 19/Apr/21 11:47 Start Date: 19/Apr/21 11:47 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2670: URL: https://github.com/apache/hadoop/pull/2670#issuecomment-822404635 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 11s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 2s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 8s | | trunk passed | | +1 :green_heart: | compile | 1m 46s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 17s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 47s | | trunk passed | | +1 :green_heart: | javadoc | 1m 5s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 45s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 16s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 37s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 26s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 25s | | the patch passed | | +1 :green_heart: | compile | 1m 23s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 23s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 15s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 44s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 58s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 7s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 257m 16s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2670/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 366m 14s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks | | | hadoop.hdfs.server.namenode.TestFileTruncate | | | hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock | | | hadoop.hdfs.TestGetBlocks | | | hadoop.hdfs.server.diskbalancer.TestDiskBalancerRPC | | | hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap | | | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithOrderedSnapshotDeletion | | | hadoop.hdfs.TestClientReportBadBlock | | | hadoop.hdfs.server.namenode.snapshot.TestAclWithSnapshot | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.blockmanagement.TestErasureCodingCorruption | | | hadoop.hdfs.server.namenode.TestMetadataVersionOutput | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandbyWithQJM | | | hadoop.hdfs.server.blockmanagement.TestSlowDiskTracker | | |
[jira] [Resolved] (HDFS-15970) Print network topology on the web
[ https://issues.apache.org/jira/browse/HDFS-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takanobu Asanuma resolved HDFS-15970. - Fix Version/s: 3.4.0 Resolution: Fixed > Print network topology on the web > - > > Key: HDFS-15970 > URL: https://issues.apache.org/jira/browse/HDFS-15970 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: hdfs-topology-json.jpg, hdfs-topology.jpg, hdfs-web.jpg > > Time Spent: 3h 10m > Remaining Estimate: 0h > > In order to query the network topology information conveniently, we can print > it on the web. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15970) Print network topology on the web
[ https://issues.apache.org/jira/browse/HDFS-15970?focusedWorklogId=585083=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585083 ] ASF GitHub Bot logged work on HDFS-15970: - Author: ASF GitHub Bot Created on: 19/Apr/21 11:43 Start Date: 19/Apr/21 11:43 Worklog Time Spent: 10m Work Description: tasanuma commented on pull request #2896: URL: https://github.com/apache/hadoop/pull/2896#issuecomment-822402613 Merged to trunk. Thanks for your contribution, @tomscut. Thanks for your review and your comments, @goiri and @ayushtkn. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585083) Time Spent: 3h 10m (was: 3h) > Print network topology on the web > - > > Key: HDFS-15970 > URL: https://issues.apache.org/jira/browse/HDFS-15970 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: hdfs-topology-json.jpg, hdfs-topology.jpg, hdfs-web.jpg > > Time Spent: 3h 10m > Remaining Estimate: 0h > > In order to query the network topology information conveniently, we can print > it on the web. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15970) Print network topology on the web
[ https://issues.apache.org/jira/browse/HDFS-15970?focusedWorklogId=585082=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585082 ] ASF GitHub Bot logged work on HDFS-15970: - Author: ASF GitHub Bot Created on: 19/Apr/21 11:42 Start Date: 19/Apr/21 11:42 Worklog Time Spent: 10m Work Description: tasanuma merged pull request #2896: URL: https://github.com/apache/hadoop/pull/2896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 585082) Time Spent: 3h (was: 2h 50m) > Print network topology on the web > - > > Key: HDFS-15970 > URL: https://issues.apache.org/jira/browse/HDFS-15970 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: hdfs-topology-json.jpg, hdfs-topology.jpg, hdfs-web.jpg > > Time Spent: 3h > Remaining Estimate: 0h > > In order to query the network topology information conveniently, we can print > it on the web. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15879) Exclude slow nodes when choose targets for blocks
[ https://issues.apache.org/jira/browse/HDFS-15879?focusedWorklogId=585075=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585075 ] ASF GitHub Bot logged work on HDFS-15879: - Author: ASF GitHub Bot Created on: 19/Apr/21 10:53 Start Date: 19/Apr/21 10:53 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2928: URL: https://github.com/apache/hadoop/pull/2928#issuecomment-822375565 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 5s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ branch-3.3 Compile Tests _ | | +0 :ok: | mvndep | 15m 28s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 24m 3s | | branch-3.3 passed | | +1 :green_heart: | compile | 20m 49s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 3m 25s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 4m 42s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 4m 26s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 10m 32s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 22m 47s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 31s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 3m 44s | | the patch passed | | +1 :green_heart: | compile | 20m 48s | | the patch passed | | -1 :x: | javac | 20m 48s | [/results-compile-javac-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/2/artifact/out/results-compile-javac-root.txt) | root generated 1 new + 1846 unchanged - 1 fixed = 1847 total (was 1847) | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 3m 17s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/2/artifact/out/results-checkstyle-root.txt) | root: The patch generated 5 new + 775 unchanged - 5 fixed = 780 total (was 780) | | +1 :green_heart: | mvnsite | 4m 46s | | the patch passed | | +1 :green_heart: | xml | 0m 1s | | The patch has no ill-formed XML file. | | +1 :green_heart: | javadoc | 4m 30s | | the patch passed | | +1 :green_heart: | spotbugs | 10m 10s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 14s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 19m 9s | | hadoop-common in the patch passed. | | +1 :green_heart: | unit | 2m 45s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 249m 34s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 446m 57s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestReconstructStripedFileWithValidator | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2928 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell xml | | uname | Linux 694ae6170bac 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / ffaf672b3f95b83013c1d941544f6886c625791f | | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~18.04-b08 | | Test Results |
[jira] [Updated] (HDFS-15872) Add the failed reason to Metrics during choosing Datanode.
[ https://issues.apache.org/jira/browse/HDFS-15872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Yun updated HDFS-15872: Summary: Add the failed reason to Metrics during choosing Datanode. (was: Add the failed reason to Metrics duiring choosing Datanode.) > Add the failed reason to Metrics during choosing Datanode. > -- > > Key: HDFS-15872 > URL: https://issues.apache.org/jira/browse/HDFS-15872 > Project: Hadoop HDFS > Issue Type: Improvement > Components: block placement, namenode > Environment: Add the failed reason to Metrics duiring choosing > Datanode. >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15872.001.patch > > > Add the failed reason to metrics duiring choosing Datanode. So we can > troubleshoot or add storage related monitoring. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds
[ https://issues.apache.org/jira/browse/HDFS-15810?focusedWorklogId=585019=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585019 ] ASF GitHub Bot logged work on HDFS-15810: - Author: ASF GitHub Bot Created on: 19/Apr/21 10:13 Start Date: 19/Apr/21 10:13 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2910: URL: https://github.com/apache/hadoop/pull/2910#issuecomment-822352050 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 16m 5s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 25m 56s | | trunk passed | | +1 :green_heart: | compile | 28m 23s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 21m 0s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 4m 2s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 30s | | trunk passed | | +1 :green_heart: | javadoc | 1m 52s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 51s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 30s | | trunk passed | | +1 :green_heart: | shadedclient | 18m 14s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 24s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 1s | | the patch passed | | +1 :green_heart: | compile | 25m 13s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 25m 13s | | the patch passed | | +1 :green_heart: | compile | 20m 10s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 20m 10s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 4m 5s | | the patch passed | | +1 :green_heart: | mvnsite | 2m 17s | | the patch passed | | +1 :green_heart: | javadoc | 1m 42s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 2m 39s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 17m 13s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 18m 14s | | hadoop-common in the patch passed. | | -1 :x: | unit | 27m 24s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2910/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 255m 32s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouterRpc | | | hadoop.hdfs.server.federation.router.TestRouterFederationRename | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2910/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2910 | | Optional Tests | dupname asflicense mvnsite codespell markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs checkstyle | | uname | Linux c35012684956 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 64dbbf2d0c6bcbcb8a8714e5e53f3b136c758af5 | | Default Java | Private
[jira] [Commented] (HDFS-15982) Deleted data on the Web UI must be saved to the trash
[ https://issues.apache.org/jira/browse/HDFS-15982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324868#comment-17324868 ] Bhavik Patel commented on HDFS-15982: - Thank you [~vjasani] for working on it. Provided PR#2927 looking good to me. [~ste...@apache.org] [~weichiu] any thoughts? > Deleted data on the Web UI must be saved to the trash > -- > > Key: HDFS-15982 > URL: https://issues.apache.org/jira/browse/HDFS-15982 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs >Reporter: Bhavik Patel >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > If we delete the data from the Web UI then it should be first moved to > configured/default Trash directory and after the trash interval time, it > should be removed. currently, data directly removed from the system[This > behavior should be the same as CLI cmd] > > This can be helpful when the user accidentally deletes data from the Web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15879) Exclude slow nodes when choose targets for blocks
[ https://issues.apache.org/jira/browse/HDFS-15879?focusedWorklogId=584963=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-584963 ] ASF GitHub Bot logged work on HDFS-15879: - Author: ASF GitHub Bot Created on: 19/Apr/21 08:32 Start Date: 19/Apr/21 08:32 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2928: URL: https://github.com/apache/hadoop/pull/2928#issuecomment-822281671 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 32m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ branch-3.3 Compile Tests _ | | +1 :green_heart: | mvninstall | 35m 50s | | branch-3.3 passed | | +1 :green_heart: | compile | 1m 14s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 0m 56s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 1m 27s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 1m 35s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 3m 22s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 20m 32s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 26s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 52s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 19s | | the patch passed | | +1 :green_heart: | xml | 0m 2s | | The patch has no ill-formed XML file. | | +1 :green_heart: | javadoc | 1m 22s | | the patch passed | | +1 :green_heart: | spotbugs | 3m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 19m 47s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 250m 36s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 47s | | The patch does not generate ASF License warnings. | | | | 375m 37s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.TestStripedFileAppend | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestReconstructStripedFileWithValidator | | | hadoop.hdfs.server.namenode.ha.TestHAAppend | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2928 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell xml | | uname | Linux fd8cb4bfaac8 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / ba44159ab2f3ce5ae78ec94d3e6a754b010620e7 | | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~18.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/1/testReport/ | | Max. process+thread count | 2824 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2928/1/console | | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=584921=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-584921 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 19/Apr/21 06:58 Start Date: 19/Apr/21 06:58 Worklog Time Spent: 10m Work Description: virajjasani commented on a change in pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#discussion_r615585428 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer2.java ## @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdfs.server.balancer; + +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.StorageType; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.hdfs.DFSClient; +import org.apache.hadoop.hdfs.DFSConfigKeys; +import org.apache.hadoop.hdfs.DFSTestUtil; +import org.apache.hadoop.hdfs.DFSUtil; +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.HdfsConfiguration; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hdfs.NameNodeProxies; +import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys; +import org.apache.hadoop.hdfs.protocol.ClientProtocol; +import org.apache.hadoop.hdfs.protocol.DatanodeID; +import org.apache.hadoop.hdfs.protocol.DatanodeInfo; +import org.apache.hadoop.hdfs.protocol.HdfsConstants; +import org.apache.hadoop.hdfs.protocol.LocatedBlock; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain; +import org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementStatus; +import org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager; +import org.apache.hadoop.hdfs.server.datanode.DataNode; +import org.apache.hadoop.hdfs.server.datanode.DataNodeTestUtils; +import org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset; +import org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.test.GenericTestUtils; +import org.junit.After; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.slf4j.event.Level; + +import java.io.OutputStream; +import java.net.InetSocketAddress; +import java.net.URI; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import static org.apache.hadoop.fs.StorageType.DEFAULT; +import static org.apache.hadoop.fs.StorageType.RAM_DISK; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BALANCER_MAX_SIZE_TO_MOVE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BLOCK_SIZE_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_BLOCK_PINNING_ENABLED; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_LAZY_WRITER_INTERVAL_SEC; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_MAX_LOCKED_MEMORY_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY; +import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_NAMENODE_LAZY_PERSIST_FILE_SCRUB_INTERVAL_SEC; +import static org.apache.hadoop.test.PlatformAssumptions.assumeNotWindows; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +/** + * Some long running Balancer tasks. + */ +public class TestBalancer2 { + + private static final Logger LOG = + LoggerFactory.getLogger(TestBalancer2.class); + + static { +GenericTestUtils.setLogLevel(Balancer.LOG, Level.TRACE); +GenericTestUtils.setLogLevel(Dispatcher.LOG, Level.DEBUG); + } + + private final static long CAPACITY = 5000L; + private final static String RACK0 = "/rack0"; +
[jira] [Work logged] (HDFS-15810) RBF: RBFMetrics's TotalCapacity out of bounds
[ https://issues.apache.org/jira/browse/HDFS-15810?focusedWorklogId=584902=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-584902 ] ASF GitHub Bot logged work on HDFS-15810: - Author: ASF GitHub Bot Created on: 19/Apr/21 06:10 Start Date: 19/Apr/21 06:10 Worklog Time Spent: 10m Work Description: aajisaka commented on pull request #2910: URL: https://github.com/apache/hadoop/pull/2910#issuecomment-822196865 Would you update the web UI to use the new metrics? https://github.com/aajisaka/hadoop/blob/486ddb73f693177787e4abff7c932be9b925/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/webapps/router/federationhealth.html#L116-L118 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 584902) Time Spent: 1h 20m (was: 1h 10m) > RBF: RBFMetrics's TotalCapacity out of bounds > - > > Key: HDFS-15810 > URL: https://issues.apache.org/jira/browse/HDFS-15810 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Xiaoxing Wei >Assignee: Fengnan Li >Priority: Major > Labels: pull-request-available > Attachments: image-2021-02-02-10-59-17-113.png > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The Long type fields TotalCapacity,UsedCapacity and RemainingCapacity in > RBFMetrics maybe ** out of bounds. > !image-2021-02-02-10-59-17-113.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15989) Split TestBalancer into two classes
[ https://issues.apache.org/jira/browse/HDFS-15989?focusedWorklogId=584898=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-584898 ] ASF GitHub Bot logged work on HDFS-15989: - Author: ASF GitHub Bot Created on: 19/Apr/21 05:59 Start Date: 19/Apr/21 05:59 Worklog Time Spent: 10m Work Description: virajjasani commented on pull request #2923: URL: https://github.com/apache/hadoop/pull/2923#issuecomment-822191601 @aajisaka would you like to take a look? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 584898) Time Spent: 2.5h (was: 2h 20m) > Split TestBalancer into two classes > --- > > Key: HDFS-15989 > URL: https://issues.apache.org/jira/browse/HDFS-15989 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > TestBalancer has many tests accumulated, it would be good to split it up into > two classes. Moreover, TestBalancer#testMaxIterationTime is flaky. We should > also resolve it with this Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org