[jira] [Commented] (HDDS-2446) ContainerReplica should contain DatanodeInfo rather than DatanodeDetails
[ https://issues.apache.org/jira/browse/HDDS-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972177#comment-16972177 ] Stephen O'Donnell commented on HDDS-2446: - I think there could be an argument for merging datanodeDetails and datanodeInfo into a single object, but that is likely a very large change and I'm not sure its the best idea either. {quote} Just a thought, what if we get the state of all available datanode at the start of ReplicationManager cycle? We can avoid multiple lookups for same datanode. {quote} I had considered this, but it doesn't really give us anything, because: 1. We would need to store the state in a hashMap or similar structure, so we still need to pay the price of the lookup per container 2. The cached data could change part way through a run. In order to make decisions about how to handle any ContainerReplica, we are going to need to know the nodeStatus (health and OpState) going forward, and I think its cleaner and more efficient if we reference datanodeInfo directly within it. The alternative is that we need to pass the NodeManager object into anything that needs to deal with the replicas and do a lookup per container via the NodeManager. That would not be terrible, but I think both DatanodeDetails and DatanodeInfo are tied very closely to registration in SCM, so we should be able to control how DatanodeInfo gets created. > ContainerReplica should contain DatanodeInfo rather than DatanodeDetails > > > Key: HDDS-2446 > URL: https://issues.apache.org/jira/browse/HDDS-2446 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: SCM >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The ContainerReplica object is used by the SCM to track containers reported > by the datanodes. The current fields stored in ContainerReplica are: > {code} > final private ContainerID containerID; > final private ContainerReplicaProto.State state; > final private DatanodeDetails datanodeDetails; > final private UUID placeOfBirth; > {code} > Now we have introduced decommission and maintenance mode, the replication > manager (and potentially other parts of the code) need to know the status of > the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to > make replication decisions. > The DatanodeDetails object does not carry this information, however the > DatanodeInfo object extends DatanodeDetails and does carry the required > information. > As DatanodeInfo extends DatanodeDetails, any place which needs a > DatanodeDetails can accept a DatanodeInfo instead. > In this Jira I propose we change the DatanodeDetails stored in > ContainerReplica to DatanodeInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14955) RBF: getQuotaUsage() on mount point should return global quota.
[ https://issues.apache.org/jira/browse/HDFS-14955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972193#comment-16972193 ] Hadoop QA commented on HDFS-14955: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 46s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 31s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 5s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 23s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 63m 45s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14955 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985594/HDFS-14955.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 73de4827f87d 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / b988487 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28292/testReport/ | | Max. process+thread count | 2763 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28292/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > RBF: getQuotaUsage() on mount point should return global quota. > --- > > Key:
[jira] [Commented] (HDFS-13811) RBF: Race condition between router admin quota update and periodic quota update service
[ https://issues.apache.org/jira/browse/HDFS-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972201#comment-16972201 ] Hadoop QA commented on HDFS-13811: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 52s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 39s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 15s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 40s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 5s{color} | {color:red} hadoop-hdfs-rbf in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 65m 45s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouterFaultTolerant | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-13811 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985596/HDFS-13811.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 648dea06f22d 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / b988487 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28293/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28293/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28293/testReport/ | | Max. process+thread count
[jira] [Created] (HDFS-14980) diskbalancer query command always tries to contact to port 9867
Nilotpal Nandi created HDFS-14980: - Summary: diskbalancer query command always tries to contact to port 9867 Key: HDFS-14980 URL: https://issues.apache.org/jira/browse/HDFS-14980 Project: Hadoop HDFS Issue Type: Bug Components: diskbalancer Reporter: Nilotpal Nandi disbalancer query commands always tries to connect to port 9867 even when datanode IPC port is different. In this setup , datanode IPC port is set to 20001. diskbalancer report command works fine and connects to IPC port 20001 {noformat} hdfs diskbalancer -report -node 172.27.131.193 19/11/12 08:58:55 INFO command.Command: Processing report command 19/11/12 08:58:57 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 19/11/12 08:58:57 INFO block.BlockTokenSecretManager: Setting block keys 19/11/12 08:58:57 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 19/11/12 08:58:58 INFO command.Command: Reporting volume information for DataNode(s). These DataNode(s) are parsed from '172.27.131.193'. Processing report command Reporting volume information for DataNode(s). These DataNode(s) are parsed from '172.27.131.193'. [172.27.131.193:20001] - : 3 volumes with node data density 0.05. [DISK: volume-/dataroot/ycloud/dfs/NEW_DISK1/] - 0.15 used: 39343871181/259692498944, 0.85 free: 220348627763/259692498944, isFailed: False, isReadOnly: False, isSkip: False, isTransient: False. [DISK: volume-/dataroot/ycloud/dfs/NEW_DISK2/] - 0.15 used: 39371179986/259692498944, 0.85 free: 220321318958/259692498944, isFailed: False, isReadOnly: False, isSkip: False, isTransient: False. [DISK: volume-/dataroot/ycloud/dfs/dn/] - 0.19 used: 49934903670/259692498944, 0.81 free: 209757595274/259692498944, isFailed: False, isReadOnly: False, isSkip: False, isTransient: False. {noformat} But diskbalancer query command fails and tries to connect to port 9867 (default port). {noformat} hdfs diskbalancer -query 172.27.131.193 19/11/12 06:37:15 INFO command.Command: Executing "query plan" command. 19/11/12 06:37:16 INFO ipc.Client: Retrying connect to server: /172.27.131.193:9867. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 19/11/12 06:37:17 INFO ipc.Client: Retrying connect to server: /172.27.131.193:9867. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) .. .. .. 19/11/12 06:37:25 ERROR tools.DiskBalancerCLI: Exception thrown while running DiskBalancerCLI. {noformat} Expectation : diskbalancer query command should work fine without explicitly mentioning datanode IPC port address -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index
[ https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972250#comment-16972250 ] Feng Yuan commented on HDFS-14617: -- in loadINodeSectionInParallel: {code:java} new Runnable() { @Override public void run() { ... ... prog.setCount(Phase.LOADING_FSIMAGE, currentStep, totalLoaded.get()); ... ... {code} why this setCount is not at out of sub-thread func? > Improve fsimage load time by writing sub-sections to the fsimage index > -- > > Key: HDFS-14617 > URL: https://issues.apache.org/jira/browse/HDFS-14617 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: HDFS-14617.001.patch, ParallelLoading.svg, > SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, > flamegraph.serial.svg, inodes.svg > > > Loading an fsimage is basically a single threaded process. The current > fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, > Snapshot_Diff etc. Then at the end of the file, an index is written that > contains the offset and length of each section. The image loader code uses > this index to initialize an input stream to read and process each section. It > is important that one section is fully loaded before another is started, as > the next section depends on the results of the previous one. > What I would like to propose is the following: > 1. When writing the image, we can optionally output sub_sections to the > index. That way, a given section would effectively be split into several > sections, eg: > {code:java} >inode_section offset 10 length 1000 > inode_sub_section offset 10 length 500 > inode_sub_section offset 510 length 500 > >inode_dir_section offset 1010 length 1000 > inode_dir_sub_section offset 1010 length 500 > inode_dir_sub_section offset 1010 length 500 > {code} > Here you can see we still have the original section index, but then we also > have sub-section entries that cover the entire section. Then a processor can > either read the full section in serial, or read each sub-section in parallel. > 2. In the Image Writer code, we should set a target number of sub-sections, > and then based on the total inodes in memory, it will create that many > sub-sections per major image section. I think the only sections worth doing > this for are inode, inode_reference, inode_dir and snapshot_diff. All others > tend to be fairly small in practice. > 3. If there are under some threshold of inodes (eg 10M) then don't bother > with the sub-sections as a serial load only takes a few seconds at that scale. > 4. The image loading code can then have a switch to enable 'parallel loading' > and a 'number of threads' where it uses the sub-sections, or if not enabled > falls back to the existing logic to read the entire section in serial. > Working with a large image of 316M inodes and 35GB on disk, I have a proof of > concept of this change working, allowing just inode and inode_dir to be > loaded in parallel, but I believe inode_reference and snapshot_diff can be > make parallel with the same technique. > Some benchmarks I have are as follows: > {code:java} > Threads 1 2 3 4 > > inodes448 290 226 189 > inode_dir 326 211 170 161 > Total 927 651 535 488 (MD5 calculation about 100 seconds) > {code} > The above table shows the time in seconds to load the inode section and the > inode_directory section, and then the total load time of the image. > With 4 threads using the above technique, we are able to better than half the > load time of the two sections. With the patch in HDFS-13694 it would take a > further 100 seconds off the run time, going from 927 seconds to 388, which is > a significant improvement. Adding more threads beyond 4 has diminishing > returns as there are some synchronized points in the loading code to protect > the in memory structures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2460) Default checksum type is wrong in description
Attila Doroszlai created HDDS-2460: -- Summary: Default checksum type is wrong in description Key: HDDS-2460 URL: https://issues.apache.org/jira/browse/HDDS-2460 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Client Reporter: Attila Doroszlai Default client checksum type is CRC32, but the config item's description says it's SHA256 (leftover from HDDS-1149). The description should be updated to match the actual default value. {code:title=https://github.com/apache/hadoop-ozone/blob/a6f80c096b5320f50b6e9e9b4ba5f7c7e3544385/hadoop-hdds/common/src/main/resources/ozone-default.xml#L1489-L1497} ozone.client.checksum.type CRC32 OZONE, CLIENT, MANAGEMENT The checksum type [NONE/ CRC32/ CRC32C/ SHA256/ MD5] determines which algorithm would be used to compute checksum for chunk data. Default checksum type is SHA256. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2446) ContainerReplica should contain DatanodeInfo rather than DatanodeDetails
[ https://issues.apache.org/jira/browse/HDDS-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972318#comment-16972318 ] Stephen O'Donnell commented on HDDS-2446: - I looked into the code a bit more to double check this area. The only place outside of tests where a DatanodeInfo object gets created is via SCMNodeMananger.register() -> nodeStateManager.addNode() -> Here it creates the new datanodeInfo. So far as I can tell, nothing cleans a registered node (DatanodeDetails or datanodeInfo) out of SCM except a restart - it will remember all nodes which have previously registered with it. If a node re-registers, the above chain of calls will give a NodeAlreadyExists exception on registration, which is caught and a success is still returned to the DN. If a node goes dead, then all its containers will be purged, but if it re-registers without being dead, the containers will still be present referencing the old DatanodeInfo object, which will not have changed. One thing we could do, is purge the container list on re-registration, as the register command should have a container report which must be processed anyway. As an aside, I wonder if there is a bug in the re-registration process - the way SCM checks if a node has already registered, is to look it up by UUID. If a DN is stopped and changes its IP or hostname, but retains the UUID, then it will 'register' successfully but the datanodeDetails information will not be updated if any of it has changed. {code} public RegisteredCommand register( DatanodeDetails datanodeDetails, NodeReportProto nodeReport, PipelineReportsProto pipelineReportsProto) { InetAddress dnAddress = Server.getRemoteIp(); if (dnAddress != null) { // Mostly called inside an RPC, update ip and peer hostname datanodeDetails.setHostName(dnAddress.getHostName()); datanodeDetails.setIpAddress(dnAddress.getHostAddress()); } try { String dnsName; String networkLocation; datanodeDetails.setNetworkName(datanodeDetails.getUuidString()); if (useHostname) { dnsName = datanodeDetails.getHostName(); } else { dnsName = datanodeDetails.getIpAddress(); } networkLocation = nodeResolve(dnsName); if (networkLocation != null) { datanodeDetails.setNetworkLocation(networkLocation); } nodeStateManager.addNode(datanodeDetails); clusterMap.add(datanodeDetails); addEntryTodnsToUuidMap(dnsName, datanodeDetails.getUuidString()); // Updating Node Report, as registration is successful processNodeReport(datanodeDetails, nodeReport); LOG.info("Registered Data node : {}", datanodeDetails); } catch (NodeAlreadyExistsException e) { if (LOG.isTraceEnabled()) { LOG.trace("Datanode is already registered. Datanode: {}", datanodeDetails.toString()); } } return RegisteredCommand.newBuilder().setErrorCode(ErrorCode.success) .setDatanode(datanodeDetails) .setClusterID(this.scmStorageConfig.getClusterID()) .build(); } {code} We should probably open another Jira if this bug is potentially there, but we may need to look at re-registration for maintenance mode anyway, as that will involve a node going dead, NOT clearing its replicas out, and then it registering again. > ContainerReplica should contain DatanodeInfo rather than DatanodeDetails > > > Key: HDDS-2446 > URL: https://issues.apache.org/jira/browse/HDDS-2446 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: SCM >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The ContainerReplica object is used by the SCM to track containers reported > by the datanodes. The current fields stored in ContainerReplica are: > {code} > final private ContainerID containerID; > final private ContainerReplicaProto.State state; > final private DatanodeDetails datanodeDetails; > final private UUID placeOfBirth; > {code} > Now we have introduced decommission and maintenance mode, the replication > manager (and potentially other parts of the code) need to know the status of > the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to > make replication decisions. > The DatanodeDetails object does not carry this information, however the > DatanodeInfo object extends DatanodeDetails and does carry the required > information. > As DatanodeInfo extends DatanodeDetails, any place which needs a > DatanodeDetails can accept a DatanodeInfo instead. > In this Jira I propose we change the DatanodeDeta
[jira] [Commented] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index
[ https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972323#comment-16972323 ] Stephen O'Donnell commented on HDFS-14617: -- [~Feng Yuan] The progress object (prog) is updated by each thread as it completes loading the inodes for the sub-section, which means anything using or monitoring the prog object can see the progress being made across all loading threads. I think this is used by the webUI to report startup progress. Therefore I think the call must be made in the sub-thread. > Improve fsimage load time by writing sub-sections to the fsimage index > -- > > Key: HDFS-14617 > URL: https://issues.apache.org/jira/browse/HDFS-14617 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: HDFS-14617.001.patch, ParallelLoading.svg, > SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, > flamegraph.serial.svg, inodes.svg > > > Loading an fsimage is basically a single threaded process. The current > fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, > Snapshot_Diff etc. Then at the end of the file, an index is written that > contains the offset and length of each section. The image loader code uses > this index to initialize an input stream to read and process each section. It > is important that one section is fully loaded before another is started, as > the next section depends on the results of the previous one. > What I would like to propose is the following: > 1. When writing the image, we can optionally output sub_sections to the > index. That way, a given section would effectively be split into several > sections, eg: > {code:java} >inode_section offset 10 length 1000 > inode_sub_section offset 10 length 500 > inode_sub_section offset 510 length 500 > >inode_dir_section offset 1010 length 1000 > inode_dir_sub_section offset 1010 length 500 > inode_dir_sub_section offset 1010 length 500 > {code} > Here you can see we still have the original section index, but then we also > have sub-section entries that cover the entire section. Then a processor can > either read the full section in serial, or read each sub-section in parallel. > 2. In the Image Writer code, we should set a target number of sub-sections, > and then based on the total inodes in memory, it will create that many > sub-sections per major image section. I think the only sections worth doing > this for are inode, inode_reference, inode_dir and snapshot_diff. All others > tend to be fairly small in practice. > 3. If there are under some threshold of inodes (eg 10M) then don't bother > with the sub-sections as a serial load only takes a few seconds at that scale. > 4. The image loading code can then have a switch to enable 'parallel loading' > and a 'number of threads' where it uses the sub-sections, or if not enabled > falls back to the existing logic to read the entire section in serial. > Working with a large image of 316M inodes and 35GB on disk, I have a proof of > concept of this change working, allowing just inode and inode_dir to be > loaded in parallel, but I believe inode_reference and snapshot_diff can be > make parallel with the same technique. > Some benchmarks I have are as follows: > {code:java} > Threads 1 2 3 4 > > inodes448 290 226 189 > inode_dir 326 211 170 161 > Total 927 651 535 488 (MD5 calculation about 100 seconds) > {code} > The above table shows the time in seconds to load the inode section and the > inode_directory section, and then the total load time of the image. > With 4 threads using the above technique, we are able to better than half the > load time of the two sections. With the patch in HDFS-13694 it would take a > further 100 seconds off the run time, going from 927 seconds to 388, which is > a significant improvement. Adding more threads beyond 4 has diminishing > returns as there are some synchronized points in the loading code to protect > the in memory structures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2446) ContainerReplica should contain DatanodeInfo rather than DatanodeDetails
[ https://issues.apache.org/jira/browse/HDDS-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972318#comment-16972318 ] Stephen O'Donnell edited comment on HDDS-2446 at 11/12/19 12:16 PM: I looked into the code a bit more to double check this area. The only place outside of tests where a DatanodeInfo object gets created is via SCMNodeMananger.register() -> nodeStateManager.addNode() -> Here it creates the new datanodeInfo. So far as I can tell, nothing cleans a registered node (DatanodeDetails or datanodeInfo) out of SCM except a restart - it will remember all nodes which have previously registered with it. If a node re-registers, the above chain of calls will give a NodeAlreadyExists exception on registration, which is caught and a success is still returned to the DN. If a node goes dead, then all its containers will be purged, but if it re-registers without being dead, the containers will still be present referencing the old DatanodeInfo object, which will not have changed. One thing we could do, is purge the container list on re-registration, as the register command should have a container report which must be processed anyway. As an aside, I wonder if there is a bug in the re-registration process - the way SCM checks if a node has already registered, is to look it up by UUID. If a DN is stopped and changes its IP or hostname, but retains the UUID, then it will 'register' successfully but the datanodeDetails information will not be updated if any of it has changed. {code} public RegisteredCommand register( DatanodeDetails datanodeDetails, NodeReportProto nodeReport, PipelineReportsProto pipelineReportsProto) { InetAddress dnAddress = Server.getRemoteIp(); if (dnAddress != null) { // Mostly called inside an RPC, update ip and peer hostname datanodeDetails.setHostName(dnAddress.getHostName()); datanodeDetails.setIpAddress(dnAddress.getHostAddress()); } try { String dnsName; String networkLocation; datanodeDetails.setNetworkName(datanodeDetails.getUuidString()); if (useHostname) { dnsName = datanodeDetails.getHostName(); } else { dnsName = datanodeDetails.getIpAddress(); } networkLocation = nodeResolve(dnsName); if (networkLocation != null) { datanodeDetails.setNetworkLocation(networkLocation); } nodeStateManager.addNode(datanodeDetails); // <<- This will throw NodeExists on re-registration, which means the nodeReport is also not processed. clusterMap.add(datanodeDetails); addEntryTodnsToUuidMap(dnsName, datanodeDetails.getUuidString()); // Updating Node Report, as registration is successful processNodeReport(datanodeDetails, nodeReport); LOG.info("Registered Data node : {}", datanodeDetails); } catch (NodeAlreadyExistsException e) { if (LOG.isTraceEnabled()) { LOG.trace("Datanode is already registered. Datanode: {}", datanodeDetails.toString()); } } return RegisteredCommand.newBuilder().setErrorCode(ErrorCode.success) .setDatanode(datanodeDetails) .setClusterID(this.scmStorageConfig.getClusterID()) .build(); } {code} We should probably open another Jira if this bug is potentially there, but we may need to look at re-registration for maintenance mode anyway, as that will involve a node going dead, NOT clearing its replicas out, and then it registering again. was (Author: sodonnell): I looked into the code a bit more to double check this area. The only place outside of tests where a DatanodeInfo object gets created is via SCMNodeMananger.register() -> nodeStateManager.addNode() -> Here it creates the new datanodeInfo. So far as I can tell, nothing cleans a registered node (DatanodeDetails or datanodeInfo) out of SCM except a restart - it will remember all nodes which have previously registered with it. If a node re-registers, the above chain of calls will give a NodeAlreadyExists exception on registration, which is caught and a success is still returned to the DN. If a node goes dead, then all its containers will be purged, but if it re-registers without being dead, the containers will still be present referencing the old DatanodeInfo object, which will not have changed. One thing we could do, is purge the container list on re-registration, as the register command should have a container report which must be processed anyway. As an aside, I wonder if there is a bug in the re-registration process - the way SCM checks if a node has already registered, is to look it up by UUID. If a DN is stopped and changes its IP or hostname, but retains the UUID, then it will 'register' successfully but the datanodeDetails information will not be updated if any of it has changed. {code} public RegisteredCommand register(
[jira] [Commented] (HDFS-14882) Consider DataNode load when #getBlockLocation
[ https://issues.apache.org/jira/browse/HDFS-14882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972334#comment-16972334 ] Istvan Fajth commented on HDFS-14882: - Hi [~hexiaoqiao], patch-10 looks good to me, sorry for the slow response, I got a bit overwhelmed with other stuff and I couldn't get back here faster. > Consider DataNode load when #getBlockLocation > - > > Key: HDFS-14882 > URL: https://issues.apache.org/jira/browse/HDFS-14882 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-14882.001.patch, HDFS-14882.002.patch, > HDFS-14882.003.patch, HDFS-14882.004.patch, HDFS-14882.005.patch, > HDFS-14882.006.patch, HDFS-14882.007.patch, HDFS-14882.008.patch, > HDFS-14882.009.patch, HDFS-14882.010.patch, HDFS-14882.suggestion > > > Currently, we consider load of datanode when #chooseTarget for writer, > however not consider it for reader. Thus, the process slot of datanode could > be occupied by #BlockSender for reader, and disk/network will be busy > workload, then meet some slow node exception. IIRC same case is reported > times. Based on the fact, I propose to consider load for reader same as it > did #chooseTarget for writer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index
[ https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972347#comment-16972347 ] Feng Yuan commented on HDFS-14617: -- [~sodonnell] Ok, i understood it. And i have a another question, Why parallel loading is disabled when compression is open? {code:java} Parallel Image loading is not supported when {} is set to" + + " true. Parallel loading will be disabled. {code} Thanks for your replys. > Improve fsimage load time by writing sub-sections to the fsimage index > -- > > Key: HDFS-14617 > URL: https://issues.apache.org/jira/browse/HDFS-14617 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: HDFS-14617.001.patch, ParallelLoading.svg, > SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, > flamegraph.serial.svg, inodes.svg > > > Loading an fsimage is basically a single threaded process. The current > fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, > Snapshot_Diff etc. Then at the end of the file, an index is written that > contains the offset and length of each section. The image loader code uses > this index to initialize an input stream to read and process each section. It > is important that one section is fully loaded before another is started, as > the next section depends on the results of the previous one. > What I would like to propose is the following: > 1. When writing the image, we can optionally output sub_sections to the > index. That way, a given section would effectively be split into several > sections, eg: > {code:java} >inode_section offset 10 length 1000 > inode_sub_section offset 10 length 500 > inode_sub_section offset 510 length 500 > >inode_dir_section offset 1010 length 1000 > inode_dir_sub_section offset 1010 length 500 > inode_dir_sub_section offset 1010 length 500 > {code} > Here you can see we still have the original section index, but then we also > have sub-section entries that cover the entire section. Then a processor can > either read the full section in serial, or read each sub-section in parallel. > 2. In the Image Writer code, we should set a target number of sub-sections, > and then based on the total inodes in memory, it will create that many > sub-sections per major image section. I think the only sections worth doing > this for are inode, inode_reference, inode_dir and snapshot_diff. All others > tend to be fairly small in practice. > 3. If there are under some threshold of inodes (eg 10M) then don't bother > with the sub-sections as a serial load only takes a few seconds at that scale. > 4. The image loading code can then have a switch to enable 'parallel loading' > and a 'number of threads' where it uses the sub-sections, or if not enabled > falls back to the existing logic to read the entire section in serial. > Working with a large image of 316M inodes and 35GB on disk, I have a proof of > concept of this change working, allowing just inode and inode_dir to be > loaded in parallel, but I believe inode_reference and snapshot_diff can be > make parallel with the same technique. > Some benchmarks I have are as follows: > {code:java} > Threads 1 2 3 4 > > inodes448 290 226 189 > inode_dir 326 211 170 161 > Total 927 651 535 488 (MD5 calculation about 100 seconds) > {code} > The above table shows the time in seconds to load the inode section and the > inode_directory section, and then the total load time of the image. > With 4 threads using the above technique, we are able to better than half the > load time of the two sections. With the patch in HDFS-13694 it would take a > further 100 seconds off the run time, going from 927 seconds to 388, which is > a significant improvement. Adding more threads beyond 4 has diminishing > returns as there are some synchronized points in the loading code to protect > the in memory structures. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972358#comment-16972358 ] Lisheng Sun commented on HDFS-14648: {quote} 2) The line newDeadNodes.retainAll(deadNodes.values()); should not be correct, it will let newDeadNodes be same with old deadnodes. {code:java} + public synchronized Set getDeadNodesToDetect() { +// remove the dead nodes who doesn't have any inputstream first +Set newDeadNodes = new HashSet(); +for (HashSet datanodeInfos : dfsInputStreamNodes.values()) { + newDeadNodes.addAll(datanodeInfos); +} + +newDeadNodes.retainAll(deadNodes.values()); + +for (DatanodeInfo datanodeInfo : deadNodes.values()) { + if (!newDeadNodes.contains(datanodeInfo)) { +deadNodes.remove(datanodeInfo); + } +} +return newDeadNodes; + } {code} {quote} Thanks [~linyiqun] for deep review comments. Finally, newDeadNodes should be same with old deadnodes in DeadNodeDetector#clearAndGetDetectedDeadNodes. And updated the patch and uploaded the v008 patch. Thank you a lot. [~linyiqun] > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14648: --- Attachment: HDFS-14648.008.patch > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14442: -- Attachment: HDFS-14442.003.patch > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, > HDFS-14442.003.patch > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14528) Failover from Active to Standby Failed
[ https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14528: -- Attachment: HDFS-14528.006.patch > Failover from Active to Standby Failed > > > Key: HDFS-14528 > URL: https://issues.apache.org/jira/browse/HDFS-14528 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Ravuri Sushma sree >Assignee: Ravuri Sushma sree >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-14528.003.patch, HDFS-14528.004.patch, > HDFS-14528.005.patch, HDFS-14528.006.patch, HDFS-14528.2.Patch, > ZKFC_issue.patch > > > *In a cluster with more than one Standby namenode, manual failover throws > exception for some cases* > *When trying to exectue the failover command from active to standby* > *._/hdfs haadmin -failover nn1 nn2, below Exception is thrown_* > Operation failed: Call From X-X-X-X/X-X-X-X to Y-Y-Y-Y: failed on > connection exception: java.net.ConnectException: Connection refused > This is encountered in the following cases : > Scenario 1 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > When trying to manually failover from NN1 to NN2 if NN3 is down, Exception is > thrown > Scenario 2 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > ZKFC's - ZKFC1, ZKFC2, ZKFC3 > When trying to manually failover using NN1 to NN3 if NN3's ZKFC (ZKFC3) is > down, Exception is thrown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14978) In-place Erasure Coding Conversion
[ https://issues.apache.org/jira/browse/HDFS-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972384#comment-16972384 ] Wei-Chiu Chuang commented on HDFS-14978: bq. What is the client behavior during the CAS operation OP_SWAP_BLOCK_LIST This operation is atomic. Semantically, it is similar to truncating the file to zero length, and then append the file with erasure coded blocks. Assuming both files are not open. A getBlockLocations() call for the $src prior to swapBlockList() gets the replicated block list. Once a client has the located blocks list, it has the block tokens too and it should be able to read without problems, even though the namespace has changed. > In-place Erasure Coding Conversion > -- > > Key: HDFS-14978 > URL: https://issues.apache.org/jira/browse/HDFS-14978 > Project: Hadoop HDFS > Issue Type: New Feature > Components: erasure-coding >Affects Versions: 3.0.0 >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Major > Attachments: In-place Erasure Coding Conversion.pdf > > > HDFS Erasure Coding is a new feature added in Apache Hadoop 3.0. It uses > encoding algorithms to reduce disk space usage while retaining redundancy > necessary for data recovery. It was a huge amount of work but it is just > getting adopted after almost 2 years. > One usability problem that’s blocking users from adopting HDFS Erasure Coding > is that existing replicated files have to be copied to an EC-enabled > directory explicitly. Renaming a file/directory to an EC-enabled directory > does not automatically convert the blocks. Therefore users typically perform > the following steps to erasure-code existing files: > {noformat} > Create $tmp directory, set EC policy at it > Distcp $src to $tmp > Delete $src (rm -rf $src) > mv $tmp $src > {noformat} > There are several reasons why this is not popular: > * Complex. The process involves several steps: distcp data to a temporary > destination; delete source file; move destination to the source path. > * Availability: there is a short period where nothing exists at the source > path, and jobs may fail unexpectedly. > * Overhead. During the copy phase, there is a point in time where all of > source and destination files exist at the same time, exhausting disk space. > * Not snapshot-friendly. If a snapshot is taken prior to performing the > conversion, the source (replicated) files will be preserved in the cluster > too. Therefore, the conversion actually increase storage space usage. > * Not management-friendly. This approach changes file inode number, > modification time and access time. Erasure coded files are supposed to store > cold data, but this conversion makes data “hot” again. > * Bulky. It’s either all or nothing. The directory may be partially erasure > coded, but this approach simply erasure code everything again. > To ease data management, we should offer a utility tool to convert replicated > files to erasure coded files in-place. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14612) SlowDiskReport won't update when SlowDisks is always empty in heartbeat
[ https://issues.apache.org/jira/browse/HDFS-14612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972486#comment-16972486 ] Haibin Huang commented on HDFS-14612: - [~weichiu],i have update this patch, would you help review it? Thank you. > SlowDiskReport won't update when SlowDisks is always empty in heartbeat > --- > > Key: HDFS-14612 > URL: https://issues.apache.org/jira/browse/HDFS-14612 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haibin Huang >Assignee: Haibin Huang >Priority: Major > Attachments: HDFS-14612-001.patch, HDFS-14612-002.patch, > HDFS-14612-003.patch, HDFS-14612-004.patch, HDFS-14612.patch > > > I found SlowDiskReport won't update when slowDisks is always empty in > org.apache.hadoop.hdfs.server.blockmanagement.*handleHeartbeat*, this may > lead to outdated SlowDiskReport alway staying in jmx of namenode until next > time slowDisks isn't empty. So i think this method > *checkAndUpdateReportIfNecessary()* should be called firstly when we want to > get the jmx information about SlowDiskReport, this can keep the > SlowDiskReport on jmx is alway valid. > > There is also some incorrect object reference on > org.apache.hadoop.hdfs.server.datanode.fsdataset. > *DataNodeVolumeMetrics* > {code:java} > // Based on writeIoRate > public long getWriteIoSampleCount() { > return syncIoRate.lastStat().numSamples(); > } > public double getWriteIoMean() { > return syncIoRate.lastStat().mean(); > } > public double getWriteIoStdDev() { > return syncIoRate.lastStat().stddev(); > } > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2461) Logging by ChunkUtils is misleading
[ https://issues.apache.org/jira/browse/HDDS-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek reassigned HDDS-2461: - Assignee: Marton Elek > Logging by ChunkUtils is misleading > --- > > Key: HDDS-2461 > URL: https://issues.apache.org/jira/browse/HDDS-2461 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > > During a k8s based test I found a lot of log message like: > {code:java} > 2019-11-12 14:27:13 WARN ChunkManagerImpl:209 - Duplicate write chunk > request. Chunk overwrite without explicit request. > ChunkInfo{chunkName='A9UrLxiEUN_testdata_chunk_4465025, offset=0, len=1024} > {code} > I was very surprised as at ChunkManagerImpl:209 there was no similar lines. > It turned out that it's logged by ChunkUtils but it's used the logger of > ChunkManagerImpl. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2461) Logging by ChunkUtils is misleading
Marton Elek created HDDS-2461: - Summary: Logging by ChunkUtils is misleading Key: HDDS-2461 URL: https://issues.apache.org/jira/browse/HDDS-2461 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Datanode Reporter: Marton Elek During a k8s based test I found a lot of log message like: {code:java} 2019-11-12 14:27:13 WARN ChunkManagerImpl:209 - Duplicate write chunk request. Chunk overwrite without explicit request. ChunkInfo{chunkName='A9UrLxiEUN_testdata_chunk_4465025, offset=0, len=1024} {code} I was very surprised as at ChunkManagerImpl:209 there was no similar lines. It turned out that it's logged by ChunkUtils but it's used the logger of ChunkManagerImpl. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2461) Logging by ChunkUtils is misleading
[ https://issues.apache.org/jira/browse/HDDS-2461?focusedWorklogId=341966&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-341966 ] ASF GitHub Bot logged work on HDDS-2461: Author: ASF GitHub Bot Created on: 12/Nov/19 15:22 Start Date: 12/Nov/19 15:22 Worklog Time Spent: 10m Work Description: elek commented on pull request #144: HDDS-2461. Logging by ChunkUtils is misleading URL: https://github.com/apache/hadoop-ozone/pull/144 ## What changes were proposed in this pull request? During a k8s based test I found a lot of log message like: ``` 2019-11-12 14:27:13 WARN ChunkManagerImpl:209 - Duplicate write chunk request. Chunk overwrite without explicit request. ChunkInfo{chunkName='A9UrLxiEUN_testdata_chunk_4465025, offset=0, len=1024} ``` I was very surprised as at `ChunkManagerImpl:209` there was no related lines. It turned out that it's logged by `ChunkUtils` but it's used the logger of `ChunkManagerImpl`. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2461 ## How was this patch tested? I deployed a new version from Ozone to the kubernetes cluster. But I also added a new test method TestChunkUtil to have at least one unit test method which uses the logger. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 341966) Remaining Estimate: 0h Time Spent: 10m > Logging by ChunkUtils is misleading > --- > > Key: HDDS-2461 > URL: https://issues.apache.org/jira/browse/HDDS-2461 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > During a k8s based test I found a lot of log message like: > {code:java} > 2019-11-12 14:27:13 WARN ChunkManagerImpl:209 - Duplicate write chunk > request. Chunk overwrite without explicit request. > ChunkInfo{chunkName='A9UrLxiEUN_testdata_chunk_4465025, offset=0, len=1024} > {code} > I was very surprised as at ChunkManagerImpl:209 there was no similar lines. > It turned out that it's logged by ChunkUtils but it's used the logger of > ChunkManagerImpl. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2461) Logging by ChunkUtils is misleading
[ https://issues.apache.org/jira/browse/HDDS-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2461: - Labels: pull-request-available (was: ) > Logging by ChunkUtils is misleading > --- > > Key: HDDS-2461 > URL: https://issues.apache.org/jira/browse/HDDS-2461 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Labels: pull-request-available > > During a k8s based test I found a lot of log message like: > {code:java} > 2019-11-12 14:27:13 WARN ChunkManagerImpl:209 - Duplicate write chunk > request. Chunk overwrite without explicit request. > ChunkInfo{chunkName='A9UrLxiEUN_testdata_chunk_4465025, offset=0, len=1024} > {code} > I was very surprised as at ChunkManagerImpl:209 there was no similar lines. > It turned out that it's logged by ChunkUtils but it's used the logger of > ChunkManagerImpl. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2415) Completely disable tracer if hdds.tracing.enabled=false
[ https://issues.apache.org/jira/browse/HDDS-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated HDDS-2415: -- Fix Version/s: 0.5.0 Resolution: Fixed Status: Resolved (was: Patch Available) > Completely disable tracer if hdds.tracing.enabled=false > --- > > Key: HDDS-2415 > URL: https://issues.apache.org/jira/browse/HDDS-2415 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance, pull-request-available > Fix For: 0.5.0 > > Attachments: allocations.png > > Time Spent: 20m > Remaining Estimate: 0h > > There is a config setting to enable/disable OpenTracing-based distributed > tracing in Ozone ({{hdds.tracing.enabled}}). However, setting it to false > does not prevent tracer initialization, which causes unnecessary object > allocations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2415) Completely disable tracer if hdds.tracing.enabled=false
[ https://issues.apache.org/jira/browse/HDDS-2415?focusedWorklogId=341985&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-341985 ] ASF GitHub Bot logged work on HDDS-2415: Author: ASF GitHub Bot Created on: 12/Nov/19 15:46 Start Date: 12/Nov/19 15:46 Worklog Time Spent: 10m Work Description: elek commented on pull request #128: HDDS-2415. Completely disable tracer if hdds.tracing.enabled=false URL: https://github.com/apache/hadoop-ozone/pull/128 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 341985) Time Spent: 20m (was: 10m) > Completely disable tracer if hdds.tracing.enabled=false > --- > > Key: HDDS-2415 > URL: https://issues.apache.org/jira/browse/HDDS-2415 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance, pull-request-available > Fix For: 0.5.0 > > Attachments: allocations.png > > Time Spent: 20m > Remaining Estimate: 0h > > There is a config setting to enable/disable OpenTracing-based distributed > tracing in Ozone ({{hdds.tracing.enabled}}). However, setting it to false > does not prevent tracer initialization, which causes unnecessary object > allocations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1868) Ozone pipelines should be marked as ready only after the leader election is complete
[ https://issues.apache.org/jira/browse/HDDS-1868?focusedWorklogId=341992&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-341992 ] ASF GitHub Bot logged work on HDDS-1868: Author: ASF GitHub Bot Created on: 12/Nov/19 15:54 Start Date: 12/Nov/19 15:54 Worklog Time Spent: 10m Work Description: nandakumar131 commented on pull request #23: HDDS-1868. Ozone pipelines should be marked as ready only after the leader election is complete. URL: https://github.com/apache/hadoop-ozone/pull/23 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 341992) Time Spent: 3h 50m (was: 3h 40m) > Ozone pipelines should be marked as ready only after the leader election is > complete > > > Key: HDDS-1868 > URL: https://issues.apache.org/jira/browse/HDDS-1868 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode, SCM >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1868.01.patch, HDDS-1868.02.patch, > HDDS-1868.03.patch, HDDS-1868.04.patch, HDDS-1868.05.patch, HDDS-1868.06.patch > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Ozone pipeline on create and restart, start in allocated state. They are > moved into open state after all the pipeline have reported to it. However, > this potentially can lead into an issue where the pipeline is still not ready > to accept any incoming IO operations. > The pipelines should be marked as ready only after the leader election is > complete and leader is ready to accept incoming IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1868) Ozone pipelines should be marked as ready only after the leader election is complete
[ https://issues.apache.org/jira/browse/HDDS-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nanda kumar updated HDDS-1868: -- Resolution: Fixed Status: Resolved (was: Patch Available) > Ozone pipelines should be marked as ready only after the leader election is > complete > > > Key: HDDS-1868 > URL: https://issues.apache.org/jira/browse/HDDS-1868 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode, SCM >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: HDDS-1868.01.patch, HDDS-1868.02.patch, > HDDS-1868.03.patch, HDDS-1868.04.patch, HDDS-1868.05.patch, HDDS-1868.06.patch > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Ozone pipeline on create and restart, start in allocated state. They are > moved into open state after all the pipeline have reported to it. However, > this potentially can lead into an issue where the pipeline is still not ready > to accept any incoming IO operations. > The pipelines should be marked as ready only after the leader election is > complete and leader is ready to accept incoming IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2462) Add jq dependency in how to contribute docs
Istvan Fajth created HDDS-2462: -- Summary: Add jq dependency in how to contribute docs Key: HDDS-2462 URL: https://issues.apache.org/jira/browse/HDDS-2462 Project: Hadoop Distributed Data Store Issue Type: Improvement Reporter: Istvan Fajth Docker based tests are using JQ to parse JMX pages of different processes, but the documentation does not mention it as a dependency. Add it to CONTRIBUTION.MD in the "Additional requirements to execute different type of tests" section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972566#comment-16972566 ] Hadoop QA commented on HDFS-14648: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 32s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 10s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 54s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 10s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 52s{color} | {color:orange} hadoop-hdfs-project: The patch generated 1 new + 112 unchanged - 1 fixed = 113 total (was 113) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 11s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 13s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs-client generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 96m 54s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 41s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}171m 42s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-project/hadoop-hdfs-client | | | org.apache.hadoop.hdfs.protocol.DatanodeInfo is incompatible with expected argument type String in org.apache.hadoop.hdfs.DeadNodeDetector.clearAndGetDetectedDeadNodes() At DeadNodeDetector.java:argument type String in org.apache.hadoop.hdfs.DeadNodeDetector.clearAndGetDetectedDeadNodes() At DeadNodeDetector.java:[line 165] | | Failed junit tests | hadoop.hdfs.server.namenode.TestReencryption | |
[jira] [Updated] (HDDS-2462) Add jq dependency in Contribution guideline
[ https://issues.apache.org/jira/browse/HDDS-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Istvan Fajth updated HDDS-2462: --- Summary: Add jq dependency in Contribution guideline (was: Add jq dependency in how to contribute docs) > Add jq dependency in Contribution guideline > --- > > Key: HDDS-2462 > URL: https://issues.apache.org/jira/browse/HDDS-2462 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Istvan Fajth >Priority: Major > > Docker based tests are using JQ to parse JMX pages of different processes, > but the documentation does not mention it as a dependency. > Add it to CONTRIBUTION.MD in the "Additional requirements to execute > different type of tests" section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972571#comment-16972571 ] Yiqun Lin commented on HDFS-14648: -- Thanks [~leosun08] , the patch almost looks good now, only some minor comments: *DFSInputStream.java* 1. There are one additional places we can add the addNodeToDeadNodeDetector call in {{createBlockReader}} {code:java} boolean createBlockReader(LocatedBlock block, long offsetInBlock, LocatedBlock[] targetBlocks, BlockReaderInfo[] readerInfos, ... } else { //TODO: handles connection issues DFSClient.LOG.warn("Failed to connect to " + dnInfo.addr + " for " + "block" + block.getBlock(), e); // re-fetch the block in case the block has been moved fetchBlockAt(block.getStartOffset()); addToLocalDeadNodes(dnInfo.info); // < } } {code} *DeadNodeDetector.java* 1.Can you address this comment that missed? {quote}1. Can we comment the name as Client context name + /** + * Client context name. + */ + private String name; {quote} 2. We can use the containsKey to check {code:java} public boolean isDeadNode(DatanodeInfo datanodeInfo) { return deadNodes.containsKey((datanodeInfo.getDatanodeUuid()); } {code} Also we can use the key to remove in method clearAndGetDetectedDeadNodes {code:java} for (DatanodeInfo datanodeInfo : deadNodes.values()) { if (!newDeadNodes.contains(datanodeInfo)) { deadNodes.remove(datanodeInfo.getDatanodeUuid()); } } {code} 3. We can periodically call clearAndGetDetectedDeadNodes to make deadNodes list be refreshed. I think deadNodes list can be a little staled when the local dead node is cleared in dfs input stream. {code:java} public void run() { while (true) { clearAndGetDetectedDeadNodes(); LOG.debug("Current detector state {}, the detected nodes: {}.", state, deadNodes.values()); switch (state) { {code} 4. Not fully get this. Why we still call this in the latest patch? Can you explain for this? {noformat} newDeadNodes.retainAll(deadNodes.values()); {noformat} *TestDFSClientDetectDeadNodes.java* 1. Can you rename the unit test name from {{TestDFSClientDetectDeadNodes}} to {{TestDeadNodeDetection}}? And simplified the comment to this: {noformat} +/** + * Tests for dead node detection in DFSClient. + */ +public class TestDeadNodeDetection { {noformat} Two other name updated: * testDetectDeadNodeInBackground --> testDeadNodeDetectionInBackground * testDeadNodeMultipleDFSInputStream --> testDeadNodeDetectionInMultipleDFSInputStream 2. No needed to call {{ThreadUtil.sleepAtLeastIgnoreInterrupts(10 * 1000L);}} I think. 3. Can we extract the DFSClient here? I see we call many times getDFSClient(). {code:java} assertEquals(1, din1.getDFSClient().getDeadNodes(din1).size()); assertEquals(1, din1.getDFSClient().getClientContext() {code} > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is
[jira] [Updated] (HDDS-2462) Add jq dependency in Contribution guideline
[ https://issues.apache.org/jira/browse/HDDS-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2462: - Labels: pull-request-available (was: ) > Add jq dependency in Contribution guideline > --- > > Key: HDDS-2462 > URL: https://issues.apache.org/jira/browse/HDDS-2462 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Istvan Fajth >Priority: Major > Labels: pull-request-available > > Docker based tests are using JQ to parse JMX pages of different processes, > but the documentation does not mention it as a dependency. > Add it to CONTRIBUTION.MD in the "Additional requirements to execute > different type of tests" section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2462) Add jq dependency in Contribution guideline
[ https://issues.apache.org/jira/browse/HDDS-2462?focusedWorklogId=342001&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-342001 ] ASF GitHub Bot logged work on HDDS-2462: Author: ASF GitHub Bot Created on: 12/Nov/19 16:08 Start Date: 12/Nov/19 16:08 Worklog Time Spent: 10m Work Description: fapifta commented on pull request #145: HDDS-2462. Add jq dependency in Contribution guideline URL: https://github.com/apache/hadoop-ozone/pull/145 ## What changes were proposed in this pull request? Documentation update, add jq dependency into the Contribution Guideline in the "Additional requirements to execute different type of tests" section ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2462 ## How was this patch tested? Doc change, no tests needed as far as I can tell. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 342001) Remaining Estimate: 0h Time Spent: 10m > Add jq dependency in Contribution guideline > --- > > Key: HDDS-2462 > URL: https://issues.apache.org/jira/browse/HDDS-2462 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Istvan Fajth >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Docker based tests are using JQ to parse JMX pages of different processes, > but the documentation does not mention it as a dependency. > Add it to CONTRIBUTION.MD in the "Additional requirements to execute > different type of tests" section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2462) Add jq dependency in Contribution guideline
[ https://issues.apache.org/jira/browse/HDDS-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Istvan Fajth reassigned HDDS-2462: -- Assignee: Istvan Fajth > Add jq dependency in Contribution guideline > --- > > Key: HDDS-2462 > URL: https://issues.apache.org/jira/browse/HDDS-2462 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Istvan Fajth >Assignee: Istvan Fajth >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Docker based tests are using JQ to parse JMX pages of different processes, > but the documentation does not mention it as a dependency. > Add it to CONTRIBUTION.MD in the "Additional requirements to execute > different type of tests" section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972577#comment-16972577 ] Hadoop QA commented on HDFS-14442: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 47s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 18s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 20s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 51s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 22) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 38s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}112m 30s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}185m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics | | | hadoop.hdfs.server.datanode.TestBlockHasMultipleReplicasOnSameDN | | | hadoop.hdfs.server.datanode.TestDataNodeReconfiguration | | | hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits | | | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped | | | hadoop.hdfs.server.datanode.checker.TestDatasetVolumeCheckerTimeout | | | hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean | | | hadoop.hdfs.server.mover.TestMover | | | hadoop.hdfs.server.mover.TestStorageMover | | | hadoop.hdfs.server.datanode.TestDataNodeLifeline | | | hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped | | | hadoop.hdfs.server.blockmanagement.TestBlockReportRateLimiting | | | hadoop.hdfs.server.datanode.TestBlockRecovery | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.blockmanagement.TestPendingReconstruction | | | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy | | | hadoop.hdfs.server.datanode.TestBatchIbr | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14442 | | JIRA Patch U
[jira] [Updated] (HDDS-2456) Add explicit base image version for images derived from ozone-runner
[ https://issues.apache.org/jira/browse/HDDS-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated HDDS-2456: -- Fix Version/s: 0.5.0 Resolution: Fixed Status: Resolved (was: Patch Available) > Add explicit base image version for images derived from ozone-runner > > > Key: HDDS-2456 > URL: https://issues.apache.org/jira/browse/HDDS-2456 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: docker >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {{ozone-om-ha}} and {{ozonescripts}} build images based on > {{apache/ozone-runner}}. > Problem: They do not specify base image versions, so it defaults to > {{latest}}. If a new {{ozone-runner}} image is published on Docker Hub, > developers needs to manually pull the {{latest}} image for it to take effect > on these derived images. > Solution: Use explicit base image version (defined by > {{OZONE_RUNNER_VERSION}} variable in {{.env}} file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2456) Add explicit base image version for images derived from ozone-runner
[ https://issues.apache.org/jira/browse/HDDS-2456?focusedWorklogId=342014&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-342014 ] ASF GitHub Bot logged work on HDDS-2456: Author: ASF GitHub Bot Created on: 12/Nov/19 16:23 Start Date: 12/Nov/19 16:23 Worklog Time Spent: 10m Work Description: elek commented on pull request #139: HDDS-2456. Add explicit base image version for images derived from ozone-runner URL: https://github.com/apache/hadoop-ozone/pull/139 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 342014) Time Spent: 20m (was: 10m) > Add explicit base image version for images derived from ozone-runner > > > Key: HDDS-2456 > URL: https://issues.apache.org/jira/browse/HDDS-2456 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: docker >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 20m > Remaining Estimate: 0h > > {{ozone-om-ha}} and {{ozonescripts}} build images based on > {{apache/ozone-runner}}. > Problem: They do not specify base image versions, so it defaults to > {{latest}}. If a new {{ozone-runner}} image is published on Docker Hub, > developers needs to manually pull the {{latest}} image for it to take effect > on these derived images. > Solution: Use explicit base image version (defined by > {{OZONE_RUNNER_VERSION}} variable in {{.env}} file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14528) Failover from Active to Standby Failed
[ https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972622#comment-16972622 ] Hadoop QA commented on HDFS-14528: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 35s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 6s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 55s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 56s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 18m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 18m 28s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 57s{color} | {color:orange} root: The patch generated 11 new + 36 unchanged - 0 fixed = 47 total (was 36) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 33s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 8s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 46s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 22s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 57s{color} | {color:red} The patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}214m 58s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | | | hadoop.hdfs.server.balancer.TestBalancerRPCDelay | | | hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy | | | hadoop.hdfs.TestFileChecksumCompositeCrc | | | hadoop.hdfs.TestErasureCodingPolicies | | | hadoop.hdfs.TestDecommissionWithStriped | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestFileAppend2 | | | hadoop.hdfs.TestReadStripedFileWithMissingBlocks | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14528 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985622/HDFS-14528.006.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite
[jira] [Updated] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14648: --- Attachment: HDFS-14648.009.patch > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, > HDFS-14648.009.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972624#comment-16972624 ] Lisheng Sun commented on HDFS-14648: Thanks [~linyiqun] for good comments. i updated the patch as your comments and uploaded the v009 patch. Thank you. > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, > HDFS-14648.009.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14981) BlockStateChange logging is exceedingly verbose
Nick Dimiduk created HDFS-14981: --- Summary: BlockStateChange logging is exceedingly verbose Key: HDFS-14981 URL: https://issues.apache.org/jira/browse/HDFS-14981 Project: Hadoop HDFS Issue Type: Bug Components: logging Reporter: Nick Dimiduk On a moderately loaded cluster, name node logs are flooded with entries of {{INFO BlockStateChange...}}, to the tune of ~30 lines per millisecond. This provides operators with little to no usable information. I suggest reducing this log message to {{DEBUG}}. Perhaps this information (and other logging related to it) should be directed to a dedicated block-audit.log file that can be queried, rotated on a separate schedule from the log of the main process. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14981) BlockStateChange logging is exceedingly verbose
[ https://issues.apache.org/jira/browse/HDFS-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972640#comment-16972640 ] Wei-Chiu Chuang commented on HDFS-14981: I think this is done by HDFS-6860. > BlockStateChange logging is exceedingly verbose > --- > > Key: HDFS-14981 > URL: https://issues.apache.org/jira/browse/HDFS-14981 > Project: Hadoop HDFS > Issue Type: Bug > Components: logging >Reporter: Nick Dimiduk >Priority: Major > > On a moderately loaded cluster, name node logs are flooded with entries of > {{INFO BlockStateChange...}}, to the tune of ~30 lines per millisecond. This > provides operators with little to no usable information. I suggest reducing > this log message to {{DEBUG}}. Perhaps this information (and other logging > related to it) should be directed to a dedicated block-audit.log file that > can be queried, rotated on a separate schedule from the log of the main > process. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14922) On StartUp , Snapshot modification time got changed
[ https://issues.apache.org/jira/browse/HDFS-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972642#comment-16972642 ] hemanthboyina commented on HDFS-14922: -- [~elgoiri] can you push the patch forward > On StartUp , Snapshot modification time got changed > --- > > Key: HDFS-14922 > URL: https://issues.apache.org/jira/browse/HDFS-14922 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14922.001.patch, HDFS-14922.002.patch, > HDFS-14922.003.patch, HDFS-14922.004.patch, HDFS-14922.005.patch > > > Snapshot modification time got changed on namenode restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14528) Failover from Active to Standby Failed
[ https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14528: -- Attachment: (was: HDFS-14528.006.patch) > Failover from Active to Standby Failed > > > Key: HDFS-14528 > URL: https://issues.apache.org/jira/browse/HDFS-14528 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Ravuri Sushma sree >Assignee: Ravuri Sushma sree >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-14528.003.patch, HDFS-14528.004.patch, > HDFS-14528.005.patch, HDFS-14528.2.Patch, ZKFC_issue.patch > > > *In a cluster with more than one Standby namenode, manual failover throws > exception for some cases* > *When trying to exectue the failover command from active to standby* > *._/hdfs haadmin -failover nn1 nn2, below Exception is thrown_* > Operation failed: Call From X-X-X-X/X-X-X-X to Y-Y-Y-Y: failed on > connection exception: java.net.ConnectException: Connection refused > This is encountered in the following cases : > Scenario 1 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > When trying to manually failover from NN1 to NN2 if NN3 is down, Exception is > thrown > Scenario 2 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > ZKFC's - ZKFC1, ZKFC2, ZKFC3 > When trying to manually failover using NN1 to NN3 if NN3's ZKFC (ZKFC3) is > down, Exception is thrown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14981) BlockStateChange logging is exceedingly verbose
[ https://issues.apache.org/jira/browse/HDFS-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Dimiduk resolved HDFS-14981. - Resolution: Duplicate Yep, I think you're right. Thanks for the pointer [~weichiu]. > BlockStateChange logging is exceedingly verbose > --- > > Key: HDFS-14981 > URL: https://issues.apache.org/jira/browse/HDFS-14981 > Project: Hadoop HDFS > Issue Type: Bug > Components: logging >Reporter: Nick Dimiduk >Priority: Major > > On a moderately loaded cluster, name node logs are flooded with entries of > {{INFO BlockStateChange...}}, to the tune of ~30 lines per millisecond. This > provides operators with little to no usable information. I suggest reducing > this log message to {{DEBUG}}. Perhaps this information (and other logging > related to it) should be directed to a dedicated block-audit.log file that > can be queried, rotated on a separate schedule from the log of the main > process. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14922) On StartUp, snapshot modification time got changed
[ https://issues.apache.org/jira/browse/HDFS-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-14922: --- Summary: On StartUp, snapshot modification time got changed (was: On StartUp , Snapshot modification time got changed) > On StartUp, snapshot modification time got changed > -- > > Key: HDFS-14922 > URL: https://issues.apache.org/jira/browse/HDFS-14922 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14922.001.patch, HDFS-14922.002.patch, > HDFS-14922.003.patch, HDFS-14922.004.patch, HDFS-14922.005.patch > > > Snapshot modification time got changed on namenode restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14922) Prevent snapshot modification time got change on startup
[ https://issues.apache.org/jira/browse/HDFS-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-14922: --- Summary: Prevent snapshot modification time got change on startup (was: On StartUp, snapshot modification time got changed) > Prevent snapshot modification time got change on startup > > > Key: HDFS-14922 > URL: https://issues.apache.org/jira/browse/HDFS-14922 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14922.001.patch, HDFS-14922.002.patch, > HDFS-14922.003.patch, HDFS-14922.004.patch, HDFS-14922.005.patch > > > Snapshot modification time got changed on namenode restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14922) Prevent snapshot modification time got change on startup
[ https://issues.apache.org/jira/browse/HDFS-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated HDFS-14922: --- Fix Version/s: 3.3.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) > Prevent snapshot modification time got change on startup > > > Key: HDFS-14922 > URL: https://issues.apache.org/jira/browse/HDFS-14922 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14922.001.patch, HDFS-14922.002.patch, > HDFS-14922.003.patch, HDFS-14922.004.patch, HDFS-14922.005.patch > > > Snapshot modification time got changed on namenode restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14922) Prevent snapshot modification time got change on startup
[ https://issues.apache.org/jira/browse/HDFS-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972694#comment-16972694 ] Íñigo Goiri commented on HDFS-14922: Thanks [~hemanthboyina] for the patch and [~virajith] for checking. Committed to trunk. > Prevent snapshot modification time got change on startup > > > Key: HDFS-14922 > URL: https://issues.apache.org/jira/browse/HDFS-14922 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14922.001.patch, HDFS-14922.002.patch, > HDFS-14922.003.patch, HDFS-14922.004.patch, HDFS-14922.005.patch > > > Snapshot modification time got changed on namenode restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14528) Failover from Active to Standby Failed
[ https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14528: -- Attachment: (was: HDFS-14528.006.patch) > Failover from Active to Standby Failed > > > Key: HDFS-14528 > URL: https://issues.apache.org/jira/browse/HDFS-14528 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Ravuri Sushma sree >Assignee: Ravuri Sushma sree >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-14528.003.patch, HDFS-14528.004.patch, > HDFS-14528.005.patch, HDFS-14528.2.Patch, ZKFC_issue.patch > > > *In a cluster with more than one Standby namenode, manual failover throws > exception for some cases* > *When trying to exectue the failover command from active to standby* > *._/hdfs haadmin -failover nn1 nn2, below Exception is thrown_* > Operation failed: Call From X-X-X-X/X-X-X-X to Y-Y-Y-Y: failed on > connection exception: java.net.ConnectException: Connection refused > This is encountered in the following cases : > Scenario 1 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > When trying to manually failover from NN1 to NN2 if NN3 is down, Exception is > thrown > Scenario 2 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > ZKFC's - ZKFC1, ZKFC2, ZKFC3 > When trying to manually failover using NN1 to NN3 if NN3's ZKFC (ZKFC3) is > down, Exception is thrown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14528) Failover from Active to Standby Failed
[ https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14528: -- Attachment: HDFS-14528.006.patch > Failover from Active to Standby Failed > > > Key: HDFS-14528 > URL: https://issues.apache.org/jira/browse/HDFS-14528 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Ravuri Sushma sree >Assignee: Ravuri Sushma sree >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-14528.003.patch, HDFS-14528.004.patch, > HDFS-14528.005.patch, HDFS-14528.006.patch, HDFS-14528.2.Patch, > ZKFC_issue.patch > > > *In a cluster with more than one Standby namenode, manual failover throws > exception for some cases* > *When trying to exectue the failover command from active to standby* > *._/hdfs haadmin -failover nn1 nn2, below Exception is thrown_* > Operation failed: Call From X-X-X-X/X-X-X-X to Y-Y-Y-Y: failed on > connection exception: java.net.ConnectException: Connection refused > This is encountered in the following cases : > Scenario 1 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > When trying to manually failover from NN1 to NN2 if NN3 is down, Exception is > thrown > Scenario 2 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > ZKFC's - ZKFC1, ZKFC2, ZKFC3 > When trying to manually failover using NN1 to NN3 if NN3's ZKFC (ZKFC3) is > down, Exception is thrown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14528) Failover from Active to Standby Failed
[ https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14528: -- Attachment: HDFS-14528.006.patch > Failover from Active to Standby Failed > > > Key: HDFS-14528 > URL: https://issues.apache.org/jira/browse/HDFS-14528 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Reporter: Ravuri Sushma sree >Assignee: Ravuri Sushma sree >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-14528.003.patch, HDFS-14528.004.patch, > HDFS-14528.005.patch, HDFS-14528.006.patch, HDFS-14528.2.Patch, > ZKFC_issue.patch > > > *In a cluster with more than one Standby namenode, manual failover throws > exception for some cases* > *When trying to exectue the failover command from active to standby* > *._/hdfs haadmin -failover nn1 nn2, below Exception is thrown_* > Operation failed: Call From X-X-X-X/X-X-X-X to Y-Y-Y-Y: failed on > connection exception: java.net.ConnectException: Connection refused > This is encountered in the following cases : > Scenario 1 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > When trying to manually failover from NN1 to NN2 if NN3 is down, Exception is > thrown > Scenario 2 : > Namenodes - NN1(Active) , NN2(Standby), NN3(Standby) > ZKFC's - ZKFC1, ZKFC2, ZKFC3 > When trying to manually failover using NN1 to NN3 if NN3's ZKFC (ZKFC3) is > down, Exception is thrown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14655) [SBN Read] Namenode crashes if one of The JN is down
[ https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972701#comment-16972701 ] Chen Liang commented on HDFS-14655: --- We have this fix in our deployment, one thing I found is that it prints a ton of WARN {{java.util.concurrent.CancellationException}} in NN logs, can we make a fix to suppress the warnings? > [SBN Read] Namenode crashes if one of The JN is down > > > Key: HDFS-14655 > URL: https://issues.apache.org/jira/browse/HDFS-14655 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Harshakiran Reddy >Assignee: Ayush Saxena >Priority: Critical > Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14655-01.patch, HDFS-14655-02.patch, > HDFS-14655-03.patch, HDFS-14655-04.patch, HDFS-14655-05.patch, > HDFS-14655-06.patch, HDFS-14655-07.patch, HDFS-14655-08.patch, > HDFS-14655-branch-2-01.patch, HDFS-14655-branch-2-02.patch, > HDFS-14655.poc.patch > > > {noformat} > 2019-07-04 17:35:54,064 | INFO | Logger channel (from parallel executor) to > XXX/XXX | Retrying connect to server: XXX/XXX. Already tried > 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) | Client.java:975 > 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered > while tailing edits. Shutting down standby NN. | EditLogTailer.java:474 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at > com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440) > at > com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565) > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > 2019-07-04 17:35:54,112 | INFO | Edit log tailer | Exiting with status 1: > java.lang.OutOfMemoryError: unable to create new native thread | > ExitUtil.java:210 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14955) RBF: getQuotaUsage() on mount point should return global quota.
[ https://issues.apache.org/jira/browse/HDFS-14955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972704#comment-16972704 ] Íñigo Goiri commented on HDFS-14955: Thanks [~LiJinglun] for the patch. * Update the javadoc for {{aggregateQuota()}}. * I think we can skip most of the for loop right before if this is a mount point. > RBF: getQuotaUsage() on mount point should return global quota. > --- > > Key: HDFS-14955 > URL: https://issues.apache.org/jira/browse/HDFS-14955 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Minor > Attachments: HDFS-14955.001.patch > > > When getQuotaUsage() on a mount point path, the quota part should be the > global quota. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14922) Prevent snapshot modification time got change on startup
[ https://issues.apache.org/jira/browse/HDFS-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972707#comment-16972707 ] Hudson commented on HDFS-14922: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17634 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17634/]) HDFS-14922. Prevent snapshot modification time got change on startup. (inigoiri: rev 40150da1e12a41c2e774fe2a277ddc3988bed239) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeDirectory.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/SnapshotManager.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirSnapshotOp.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestSnapshot.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectorySnapshottableFeature.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestSnapshotManager.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java > Prevent snapshot modification time got change on startup > > > Key: HDFS-14922 > URL: https://issues.apache.org/jira/browse/HDFS-14922 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14922.001.patch, HDFS-14922.002.patch, > HDFS-14922.003.patch, HDFS-14922.004.patch, HDFS-14922.005.patch > > > Snapshot modification time got changed on namenode restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14442: -- Attachment: (was: HDFS-14442.003.patch) > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, > HDFS-14442.003.PATCH > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14442: -- Attachment: HDFS-14442.003.PATCH > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, > HDFS-14442.003.PATCH > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972714#comment-16972714 ] Hadoop QA commented on HDFS-14442: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 5s{color} | {color:blue} The patch file was not named according to hadoop's naming conventions. Please see https://wiki.apache.org/hadoop/HowToContribute for instructions. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 8s{color} | {color:red} HDFS-14442 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-14442 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985651/HDFS-14442.003.PATCH | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28299/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, > HDFS-14442.003.PATCH > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14442: -- Attachment: (was: HDFS-14442.003.PATCH) > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14959) [SBNN read] access time should be turned off
[ https://issues.apache.org/jira/browse/HDFS-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-14959: --- Fix Version/s: 3.2.2 3.1.4 3.3.0 > [SBNN read] access time should be turned off > > > Key: HDFS-14959 > URL: https://issues.apache.org/jira/browse/HDFS-14959 > Project: Hadoop HDFS > Issue Type: Task > Components: documentation >Reporter: Wei-Chiu Chuang >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > > Both Uber and Didi shared that access time has to be switched off to avoid > spiky NameNode RPC process time. If access time is not off entirely, > getBlockLocations RPCs have to update access time and must access the active > NameNode. (that's my understanding. haven't checked the code) > We should record this as a best practice in our doc. > (If you are on the ASF slack, check out this thread > https://the-asf.slack.com/archives/CAD7C52Q3/p1572033324008600) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14959) [SBNN read] access time should be turned off
[ https://issues.apache.org/jira/browse/HDFS-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-14959. Resolution: Fixed Merged the PR to trunk and cherry pick the commit to branch-3.2 and branch-3.1. Thanks [~csun]! > [SBNN read] access time should be turned off > > > Key: HDFS-14959 > URL: https://issues.apache.org/jira/browse/HDFS-14959 > Project: Hadoop HDFS > Issue Type: Task > Components: documentation >Reporter: Wei-Chiu Chuang >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > > Both Uber and Didi shared that access time has to be switched off to avoid > spiky NameNode RPC process time. If access time is not off entirely, > getBlockLocations RPCs have to update access time and must access the active > NameNode. (that's my understanding. haven't checked the code) > We should record this as a best practice in our doc. > (If you are on the ASF slack, check out this thread > https://the-asf.slack.com/archives/CAD7C52Q3/p1572033324008600) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972735#comment-16972735 ] Hadoop QA commented on HDFS-14648: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 47s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 40s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 51s{color} | {color:green} hadoop-hdfs-project: The patch generated 0 new + 112 unchanged - 1 fixed = 112 total (was 113) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 40s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 51s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}106m 28s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}182m 49s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.tools.TestDFSZKFailoverController | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14648 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985638/HDFS-14648.009.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 40af05af70d8 4.15.0-6
[jira] [Commented] (HDFS-14959) [SBNN read] access time should be turned off
[ https://issues.apache.org/jira/browse/HDFS-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972736#comment-16972736 ] Hudson commented on HDFS-14959: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17636 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17636/]) HDFS-14959: [SBNN read] access time should be turned off (#1706) (weichiu: rev 97ec34e117af71e1a9950b8002131c45754009c7) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ObserverNameNode.md > [SBNN read] access time should be turned off > > > Key: HDFS-14959 > URL: https://issues.apache.org/jira/browse/HDFS-14959 > Project: Hadoop HDFS > Issue Type: Task > Components: documentation >Reporter: Wei-Chiu Chuang >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > > Both Uber and Didi shared that access time has to be switched off to avoid > spiky NameNode RPC process time. If access time is not off entirely, > getBlockLocations RPCs have to update access time and must access the active > NameNode. (that's my understanding. haven't checked the code) > We should record this as a best practice in our doc. > (If you are on the ASF slack, check out this thread > https://the-asf.slack.com/archives/CAD7C52Q3/p1572033324008600) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2463) Remove unnecessary getServiceInfo calls
Xiaoyu Yao created HDDS-2463: Summary: Remove unnecessary getServiceInfo calls Key: HDDS-2463 URL: https://issues.apache.org/jira/browse/HDDS-2463 Project: Hadoop Distributed Data Store Issue Type: Bug Affects Versions: 0.4.1 Reporter: Xiaoyu Yao Assignee: Xiaoyu Yao OzoneManagerProtocolClientSideTranslatorPB.java Line 766-772 has multiple impl.getServiceInfo() which can be reduced by adding a local variable. {code:java} resp.addAllServiceInfo(impl.getServiceInfo().getServiceInfoList().stream() .map(ServiceInfo::getProtobuf) .collect(Collectors.toList())); if (impl.getServiceInfo().getCaCertificate() != null) { resp.setCaCertificate(impl.getServiceInfo().getCaCertificate()); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2462) Add jq dependency in Contribution guideline
[ https://issues.apache.org/jira/browse/HDDS-2462?focusedWorklogId=342146&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-342146 ] ASF GitHub Bot logged work on HDDS-2462: Author: ASF GitHub Bot Created on: 12/Nov/19 20:57 Start Date: 12/Nov/19 20:57 Worklog Time Spent: 10m Work Description: anuengineer commented on pull request #145: HDDS-2462. Add jq dependency in Contribution guideline URL: https://github.com/apache/hadoop-ozone/pull/145 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 342146) Time Spent: 20m (was: 10m) > Add jq dependency in Contribution guideline > --- > > Key: HDDS-2462 > URL: https://issues.apache.org/jira/browse/HDDS-2462 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Istvan Fajth >Assignee: Istvan Fajth >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Docker based tests are using JQ to parse JMX pages of different processes, > but the documentation does not mention it as a dependency. > Add it to CONTRIBUTION.MD in the "Additional requirements to execute > different type of tests" section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDDS-2462) Add jq dependency in Contribution guideline
[ https://issues.apache.org/jira/browse/HDDS-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anu Engineer resolved HDDS-2462. Fix Version/s: 0.5.0 Resolution: Fixed Committed to Master branch. > Add jq dependency in Contribution guideline > --- > > Key: HDDS-2462 > URL: https://issues.apache.org/jira/browse/HDDS-2462 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Istvan Fajth >Assignee: Istvan Fajth >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Docker based tests are using JQ to parse JMX pages of different processes, > but the documentation does not mention it as a dependency. > Add it to CONTRIBUTION.MD in the "Additional requirements to execute > different type of tests" section. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2463) Reduce unnecessary getServiceInfo calls
[ https://issues.apache.org/jira/browse/HDDS-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoyu Yao updated HDDS-2463: - Summary: Reduce unnecessary getServiceInfo calls (was: Remove unnecessary getServiceInfo calls) > Reduce unnecessary getServiceInfo calls > --- > > Key: HDDS-2463 > URL: https://issues.apache.org/jira/browse/HDDS-2463 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Xiaoyu Yao >Assignee: Xiaoyu Yao >Priority: Major > > OzoneManagerProtocolClientSideTranslatorPB.java Line 766-772 has multiple > impl.getServiceInfo() which can be reduced by adding a local variable. > {code:java} > > resp.addAllServiceInfo(impl.getServiceInfo().getServiceInfoList().stream() > .map(ServiceInfo::getProtobuf) > .collect(Collectors.toList())); > if (impl.getServiceInfo().getCaCertificate() != null) { > resp.setCaCertificate(impl.getServiceInfo().getCaCertificate()); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2463) Reduce unnecessary getServiceInfo calls
[ https://issues.apache.org/jira/browse/HDDS-2463?focusedWorklogId=342180&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-342180 ] ASF GitHub Bot logged work on HDDS-2463: Author: ASF GitHub Bot Created on: 12/Nov/19 21:32 Start Date: 12/Nov/19 21:32 Worklog Time Spent: 10m Work Description: xiaoyuyao commented on pull request #146: HDDS-2463. Reduce unnecessary getServiceInfo calls. Contributed by Xi… URL: https://github.com/apache/hadoop-ozone/pull/146 …aoyu Yao. ## What changes were proposed in this pull request? reduce unncessary getServiceInfo calls. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2463 ## How was this patch tested? Run Ozone RPC related unit tests and acceptance tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 342180) Remaining Estimate: 0h Time Spent: 10m > Reduce unnecessary getServiceInfo calls > --- > > Key: HDDS-2463 > URL: https://issues.apache.org/jira/browse/HDDS-2463 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Xiaoyu Yao >Assignee: Xiaoyu Yao >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > OzoneManagerProtocolClientSideTranslatorPB.java Line 766-772 has multiple > impl.getServiceInfo() which can be reduced by adding a local variable. > {code:java} > > resp.addAllServiceInfo(impl.getServiceInfo().getServiceInfoList().stream() > .map(ServiceInfo::getProtobuf) > .collect(Collectors.toList())); > if (impl.getServiceInfo().getCaCertificate() != null) { > resp.setCaCertificate(impl.getServiceInfo().getCaCertificate()); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2463) Reduce unnecessary getServiceInfo calls
[ https://issues.apache.org/jira/browse/HDDS-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2463: - Labels: pull-request-available (was: ) > Reduce unnecessary getServiceInfo calls > --- > > Key: HDDS-2463 > URL: https://issues.apache.org/jira/browse/HDDS-2463 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Xiaoyu Yao >Assignee: Xiaoyu Yao >Priority: Major > Labels: pull-request-available > > OzoneManagerProtocolClientSideTranslatorPB.java Line 766-772 has multiple > impl.getServiceInfo() which can be reduced by adding a local variable. > {code:java} > > resp.addAllServiceInfo(impl.getServiceInfo().getServiceInfoList().stream() > .map(ServiceInfo::getProtobuf) > .collect(Collectors.toList())); > if (impl.getServiceInfo().getCaCertificate() != null) { > resp.setCaCertificate(impl.getServiceInfo().getCaCertificate()); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2464) Avoid unnecessary allocations for FileChannel.open call
Attila Doroszlai created HDDS-2464: -- Summary: Avoid unnecessary allocations for FileChannel.open call Key: HDDS-2464 URL: https://issues.apache.org/jira/browse/HDDS-2464 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone Datanode Reporter: Attila Doroszlai Assignee: Attila Doroszlai {{ChunkUtils}} calls {{FileChannel#open(Path, OpenOption...)}}. Vararg array elements are then added to a new {{HashSet}} to call {{FileChannel#open(Path, Set, FileAttribute...)}}. We can call the latter directly instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2105) Merge OzoneClientFactory#getRpcClient functions
[ https://issues.apache.org/jira/browse/HDDS-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyao Meng updated HDDS-2105: - Description: Ref: https://github.com/apache/hadoop/pull/1360#discussion_r321585214 There will be 5 overloaded OzoneClientFactory#getRpcClient functions (when HDDS-2007 is committed). They contains some redundant logic and unnecessarily increases code paths. Goal: Merge those functions into fewer. was: Ref: https://github.com/apache/hadoop/pull/1360#discussion_r321585214 There will be 5 overloaded OzoneClientFactory#getRpcClient functions (when HDDS-2007 is committed). They contains some redundant logic and unnecessarily increases code paths. Goal: Merge those functions into one or two. Work will begin after HDDS-2007 is committed. > Merge OzoneClientFactory#getRpcClient functions > --- > > Key: HDDS-2105 > URL: https://issues.apache.org/jira/browse/HDDS-2105 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > > Ref: https://github.com/apache/hadoop/pull/1360#discussion_r321585214 > There will be 5 overloaded OzoneClientFactory#getRpcClient functions (when > HDDS-2007 is committed). They contains some redundant logic and unnecessarily > increases code paths. > Goal: Merge those functions into fewer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2105) Merge OzoneClientFactory#getRpcClient functions
[ https://issues.apache.org/jira/browse/HDDS-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2105: - Labels: pull-request-available (was: ) > Merge OzoneClientFactory#getRpcClient functions > --- > > Key: HDDS-2105 > URL: https://issues.apache.org/jira/browse/HDDS-2105 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Labels: pull-request-available > > Ref: https://github.com/apache/hadoop/pull/1360#discussion_r321585214 > There will be 5 overloaded OzoneClientFactory#getRpcClient functions (when > HDDS-2007 is committed). They contains some redundant logic and unnecessarily > increases code paths. > Goal: Merge those functions into fewer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2464) Avoid unnecessary allocations for FileChannel.open call
[ https://issues.apache.org/jira/browse/HDDS-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2464: - Labels: pull-request-available (was: ) > Avoid unnecessary allocations for FileChannel.open call > --- > > Key: HDDS-2464 > URL: https://issues.apache.org/jira/browse/HDDS-2464 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: pull-request-available > > {{ChunkUtils}} calls {{FileChannel#open(Path, OpenOption...)}}. Vararg array > elements are then added to a new {{HashSet}} to call {{FileChannel#open(Path, > Set, FileAttribute...)}}. We can call the latter > directly instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2105) Merge OzoneClientFactory#getRpcClient functions
[ https://issues.apache.org/jira/browse/HDDS-2105?focusedWorklogId=342234&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-342234 ] ASF GitHub Bot logged work on HDDS-2105: Author: ASF GitHub Bot Created on: 12/Nov/19 22:14 Start Date: 12/Nov/19 22:14 Worklog Time Spent: 10m Work Description: smengcl commented on pull request #148: HDDS-2105. Merge OzoneClientFactory#getRpcClient functions URL: https://github.com/apache/hadoop-ozone/pull/148 ## What changes were proposed in this pull request? There are in total 6 overloaded `OzoneClientFactory#getRpcClient` functions now. Some of them are not used or just used once. Remove/merge some of them. (Should be fine to simply remove public function without deprecating at this moment since ozone is still in alpha?) ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2105 ## How was this patch tested? Rerun all existing tests, since this is just a straightforward refactoring. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 342234) Remaining Estimate: 0h Time Spent: 10m > Merge OzoneClientFactory#getRpcClient functions > --- > > Key: HDDS-2105 > URL: https://issues.apache.org/jira/browse/HDDS-2105 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Ref: https://github.com/apache/hadoop/pull/1360#discussion_r321585214 > There will be 5 overloaded OzoneClientFactory#getRpcClient functions (when > HDDS-2007 is committed). They contains some redundant logic and unnecessarily > increases code paths. > Goal: Merge those functions into fewer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2464) Avoid unnecessary allocations for FileChannel.open call
[ https://issues.apache.org/jira/browse/HDDS-2464?focusedWorklogId=342233&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-342233 ] ASF GitHub Bot logged work on HDDS-2464: Author: ASF GitHub Bot Created on: 12/Nov/19 22:14 Start Date: 12/Nov/19 22:14 Worklog Time Spent: 10m Work Description: adoroszlai commented on pull request #147: HDDS-2464. Avoid unnecessary allocations for FileChannel.open call URL: https://github.com/apache/hadoop-ozone/pull/147 ## What changes were proposed in this pull request? `ChunkUtils` calls `FileChannel#open(Path, OpenOption...)`. Vararg array elements are then added to a new `HashSet` to be passed to `FileChannel#open(Path, Set, FileAttribute...)`. We can call the latter directly instead. https://issues.apache.org/jira/browse/HDDS-2464 ## How was this patch tested? Ran `TestChunkUtils`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 342233) Remaining Estimate: 0h Time Spent: 10m > Avoid unnecessary allocations for FileChannel.open call > --- > > Key: HDDS-2464 > URL: https://issues.apache.org/jira/browse/HDDS-2464 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {{ChunkUtils}} calls {{FileChannel#open(Path, OpenOption...)}}. Vararg array > elements are then added to a new {{HashSet}} to call {{FileChannel#open(Path, > Set, FileAttribute...)}}. We can call the latter > directly instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDDS-2105) Merge OzoneClientFactory#getRpcClient functions
[ https://issues.apache.org/jira/browse/HDDS-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDDS-2105 started by Siyao Meng. > Merge OzoneClientFactory#getRpcClient functions > --- > > Key: HDDS-2105 > URL: https://issues.apache.org/jira/browse/HDDS-2105 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Ref: https://github.com/apache/hadoop/pull/1360#discussion_r321585214 > There will be 5 overloaded OzoneClientFactory#getRpcClient functions (when > HDDS-2007 is committed). They contains some redundant logic and unnecessarily > increases code paths. > Goal: Merge those functions into fewer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2464) Avoid unnecessary allocations for FileChannel.open call
[ https://issues.apache.org/jira/browse/HDDS-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Doroszlai updated HDDS-2464: --- Status: Patch Available (was: Open) > Avoid unnecessary allocations for FileChannel.open call > --- > > Key: HDDS-2464 > URL: https://issues.apache.org/jira/browse/HDDS-2464 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {{ChunkUtils}} calls {{FileChannel#open(Path, OpenOption...)}}. Vararg array > elements are then added to a new {{HashSet}} to call {{FileChannel#open(Path, > Set, FileAttribute...)}}. We can call the latter > directly instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14982) Backport HADOOP-16152 to branch-3.1
Siyao Meng created HDFS-14982: - Summary: Backport HADOOP-16152 to branch-3.1 Key: HDFS-14982 URL: https://issues.apache.org/jira/browse/HDFS-14982 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.1.3 Reporter: Siyao Meng Assignee: Siyao Meng HADOOP-16152. Upgrade Eclipse Jetty version to 9.4.x -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14884) Add sanity check that zone key equals feinfo key while setting Xattrs
[ https://issues.apache.org/jira/browse/HDFS-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972859#comment-16972859 ] Wei-Chiu Chuang commented on HDFS-14884: Sorry missed it. I'll review it for sure. > Add sanity check that zone key equals feinfo key while setting Xattrs > - > > Key: HDFS-14884 > URL: https://issues.apache.org/jira/browse/HDFS-14884 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption, hdfs >Affects Versions: 2.11.0 >Reporter: Mukul Kumar Singh >Assignee: Yuval Degani >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.11.0 > > Attachments: HDFS-14884-branch-2.001.patch, HDFS-14884.001.patch, > HDFS-14884.002.patch, HDFS-14884.003.patch, hdfs_distcp.patch > > > Currently, it is possible to set an external attribute where the zone key is > not the same as feinfo key. This jira will add a precondition before setting > this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14528) Failover from Active to Standby Failed
[ https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972872#comment-16972872 ] Hadoop QA commented on HDFS-14528: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 15s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 22m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 0s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 42s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 37s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 2m 36s{color} | {color:orange} root: The patch generated 3 new + 36 unchanged - 0 fixed = 39 total (was 36) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 26s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 47s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 40s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 56s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 98m 41s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 48s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}230m 5s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.tools.TestObserverManualFailover | | | hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer | | | hadoop.hdfs.server.balancer.TestBalancer | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14528 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985649/HDFS-14528.006.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5c78eec29cb6 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality
[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972874#comment-16972874 ] Bharat Viswanadham commented on HDDS-2356: -- Hi [~timmylicheng] Thanks for sharing the logs. I see an abort multipart upload request for the key plc_1570863541668_9278 once complete multipart upload failed. 2019-11-08 20:08:24,830 | ERROR | OMAudit | user=root | ip=9.134.50.210 | op=COMPLETE_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationIn fo=[], multipartList=[partNumber: 1 5626 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374356085" 5627 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374487158" . . 5911 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102211984655258" 5912 ]} | ret=FAILURE | INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: ozone-testkey: plc_1570863541668_9278 2019-11-08 20:08:24,963 | INFO | OMAudit | user=root | ip=9.134.50.210 | op=ABORT_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo= []} | ret=SUCCESS | And after that still, allocateBlock is continuing for the key because the entry from openKeyTable is not removed by abortMultipartUpload request.(Abort removed only entry which has been created during initiateMPU request, so that is the reason after some time you see the NO_SUCH_MULTIPART_UPLOAD error during commitMultipartUploadKey, as we removed entry from MultipartInfo table. (But the strange thing I have observed is the clientID is not matching with any of the name in the partlist, as partName lastpart is clientID.) And from the OM audit log, I see partNumber 1, and a list of multipart names, not sure if some log is truncated here. As it should show like part name, partNumber. # If you can confirm for this key what are parts in OM, you can get this from listParts(But this should be done before abort request). # Check in the OM audit log for this key what is the partlist we get, not sure in the uploaded log it is truncated. On my cluster audit logs look like below. {code:java} 2019-11-12 14:57:18,580 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=INITIATE_MULTIPART_UPLOAD {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=0, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,967 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_KEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,974 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_BLOCK {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], clientID=103127415125868581} | ret=SUCCESS | 2019-11-12 14:57:54,154 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=COMMIT_MULTIPART_UPLOAD_PARTKEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[blockID { containerBlockID { containerID: 6 localID: 103127415126327331 } blockCommitSequenceId: 18 } offset: 0 length: 5242880 createVersion: 0 pipeline { leaderID: "" members { uuid: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" ipAddress: "10.65.49.251" hostName: "bh-ozone-3.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" networkLocation: "/default-rack" } members { uuid: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" ipAddress: "10.65.51.23" hostName: "bh-ozone-4.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" networkLocation: "/default-rack" } members { uuid: "cf8aace1-92b8-496e-aed9-f2771c83a56b" ipAddress: "10.65.53.160" hostName: "bh-ozone-2.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "cf8aace1-92b8-496e-aed9-f2771c83a56b" networkLocation: "/default-rack" } state: PIPELINE_OPEN type: RATIS factor: T
[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972874#comment-16972874 ] Bharat Viswanadham edited comment on HDDS-2356 at 11/12/19 11:08 PM: - Hi [~timmylicheng] Thanks for sharing the logs. I see an abort multipart upload request for the key plc_1570863541668_9278 once complete multipart upload failed. 2019-11-08 20:08:24,830 | ERROR | OMAudit | user=root | ip=9.134.50.210 | op=COMPLETE_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationIn fo=[], multipartList=[partNumber: 1 5626 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374356085" 5627 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374487158" . . 5911 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102211984655258" 5912 ]} | ret=FAILURE | INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: ozone-testkey: plc_1570863541668_9278 2019-11-08 20:08:24,963 | INFO | OMAudit | user=root | ip=9.134.50.210 | op=ABORT_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo= []} | ret=SUCCESS | And after that still, allocateBlock is continuing for the key because the entry from openKeyTable is not removed by abortMultipartUpload request.(Abort removed only entry which has been created during initiateMPU request, so that is the reason after some time you see the NO_SUCH_MULTIPART_UPLOAD error during commitMultipartUploadKey, as we removed entry from MultipartInfo table. (But the strange thing I have observed is the clientID is not matching with any of the name in the partlist, as partName lastpart is clientID.) And from the OM audit log, I see partNumber 1, and a list of multipart names, not sure if some log is truncated here. As it should show like part name, partNumber. # If you can confirm for this key what are parts in OM, you can get this from listParts(But this should be done before abort request). # Check in the OM audit log for this key what is the partlist we get, not sure in the uploaded log it is truncated. On my cluster audit logs look like below, where when completeMultipartUpload, I can see partNumber and partName.(Whereas in the uploaded log, I don't see like below) {code:java} 2019-11-12 14:57:18,580 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=INITIATE_MULTIPART_UPLOAD {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=0, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,967 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_KEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,974 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_BLOCK {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], clientID=103127415125868581} | ret=SUCCESS | 2019-11-12 14:57:54,154 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=COMMIT_MULTIPART_UPLOAD_PARTKEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[blockID { containerBlockID { containerID: 6 localID: 103127415126327331 } blockCommitSequenceId: 18 } offset: 0 length: 5242880 createVersion: 0 pipeline { leaderID: "" members { uuid: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" ipAddress: "10.65.49.251" hostName: "bh-ozone-3.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" networkLocation: "/default-rack" } members { uuid: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" ipAddress: "10.65.51.23" hostName: "bh-ozone-4.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" networkLocation: "/default-rack" } members { uuid: "cf8aace1-92b8-496e-aed9-f2771c83a56b" ipAddress: "10.65.53.160" hostName: "bh-ozone-2.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDA
[jira] [Resolved] (HDFS-14792) [SBN read] StanbyNode does not come out of safemode while adding new blocks.
[ https://issues.apache.org/jira/browse/HDFS-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko resolved HDFS-14792. Fix Version/s: 2.10.1 Resolution: Fixed This turned out to be related to the same race condition between edits {{OP_ADD_BLOCK}} and IBRs of HDFS-14941. We do not see any delays in leaving safemode on StandbyNode after the HDFS-14941 fix. Closing this as fixed. > [SBN read] StanbyNode does not come out of safemode while adding new blocks. > > > Key: HDFS-14792 > URL: https://issues.apache.org/jira/browse/HDFS-14792 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Priority: Major > Fix For: 2.10.1 > > > During startup StandbyNode reports that it needs additional X blocks to reach > the threshold 1.. Where X is changing up and down. > This is because with fast tailing SBN adds new blocks from edits while DNs > have not reported replicas yet. Being in SafeMode SBN counts new blocks > towards the threshold and can stay in SafeMode for a long time. > By design, the purpose of startup SafeMode is to disallow modifications of > the namespace and blocks map until all DN replicas are reported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972874#comment-16972874 ] Bharat Viswanadham edited comment on HDDS-2356 at 11/12/19 11:17 PM: - Hi [~timmylicheng] Thanks for sharing the logs. I see an abort multipart upload request for the key plc_1570863541668_9278 once complete multipart upload failed. {code:java} 2019-11-08 20:08:24,830 | ERROR | OMAudit | user=root | ip=9.134.50.210 | op=COMPLETE_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationIn fo=[], multipartList=[partNumber: 1 5626 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374356085" 5627 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374487158" . . 5911 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102211984655258" 5912 ]} | ret=FAILURE | INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: ozone-testkey: plc_1570863541668_9278 2019-11-08 20:08:24,963 | INFO | OMAudit | user=root | ip=9.134.50.210 | op=ABORT_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo= []} {code} And after that still, allocateBlock is continuing for the key because the entry from openKeyTable is not removed by abortMultipartUpload request.(Abort removed only entry which has been created during initiateMPU request, so that is the reason after some time you see the NO_SUCH_MULTIPART_UPLOAD error during commitMultipartUploadKey, as we removed entry from MultipartInfo table. (But the strange thing I have observed is the clientID is not matching with any of the name in the partlist, as partName lastpart is clientID.) And from the OM audit log, I see partNumber 1, and a list of multipart names, not sure if some log is truncated here. As it should show like part name, partNumber. # If you can confirm for this key what are parts in OM, you can get this from listParts(But this should be done before abort request). # Check in the OM audit log for this key what is the partlist we get, not sure in the uploaded log it is truncated. On my cluster audit logs look like below, where when completeMultipartUpload, I can see partNumber and partName.(Whereas in the uploaded log, I don't see like below) {code:java} 2019-11-12 14:57:18,580 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=INITIATE_MULTIPART_UPLOAD {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=0, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,967 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_KEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,974 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_BLOCK {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], clientID=103127415125868581} | ret=SUCCESS | 2019-11-12 14:57:54,154 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=COMMIT_MULTIPART_UPLOAD_PARTKEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[blockID { containerBlockID { containerID: 6 localID: 103127415126327331 } blockCommitSequenceId: 18 } offset: 0 length: 5242880 createVersion: 0 pipeline { leaderID: "" members { uuid: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" ipAddress: "10.65.49.251" hostName: "bh-ozone-3.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" networkLocation: "/default-rack" } members { uuid: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" ipAddress: "10.65.51.23" hostName: "bh-ozone-4.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" networkLocation: "/default-rack" } members { uuid: "cf8aace1-92b8-496e-aed9-f2771c83a56b" ipAddress: "10.65.53.160" hostName: "bh-ozone-2.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDA
[jira] [Comment Edited] (HDFS-14283) DFSInputStream to prefer cached replica
[ https://issues.apache.org/jira/browse/HDFS-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972895#comment-16972895 ] Siyao Meng edited comment on HDFS-14283 at 11/12/19 11:20 PM: -- [~leosun08] I discussed with [~weichiu]. I'm fine with ditching the sorting logic on the server side so that we don't need to make any server side changes in this patch. One reason is that in most cases there will only be one cached replica for a block. We will simply allow the client to prefer the cached replica with a configuration option then. was (Author: smeng): [~leosun08] I discussed with [~weichiu]. I'm fine with ditching the sorting logic on the server side so that we don't need to make any server side changed in this patch. One reason is that in most cases there will only be one cached replica for a block. We will simply allow the client to prefer the cached replica with a configuration option then. > DFSInputStream to prefer cached replica > --- > > Key: HDFS-14283 > URL: https://issues.apache.org/jira/browse/HDFS-14283 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.6.0 > Environment: HDFS Caching >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14283.001.patch, HDFS-14283.002.patch, > HDFS-14283.003.patch, HDFS-14283.004.patch > > > HDFS Caching offers performance benefits. However, currently NameNode does > not treat cached replica with higher priority, so HDFS caching is only useful > when cache replication = 3, that is to say, all replicas are cached in > memory, so that a client doesn't randomly pick an uncached replica. > HDFS-6846 proposed to let NameNode give higher priority to cached replica. > Changing a logic in NameNode is always tricky so that didn't get much > traction. Here I propose a different approach: let client (DFSInputStream) > prefer cached replica. > A {{LocatedBlock}} object already contains cached replica location so a > client has the needed information. I think we can change > {{DFSInputStream#getBestNodeDNAddrPair()}} for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14283) DFSInputStream to prefer cached replica
[ https://issues.apache.org/jira/browse/HDFS-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972895#comment-16972895 ] Siyao Meng commented on HDFS-14283: --- [~leosun08] I discussed with [~weichiu]. I'm fine with ditching the sorting logic on the server side so that we don't need to make any server side changed in this patch. One reason is that in most cases there will only be one cached replica for a block. We will simply allow the client to prefer the cached replica with a configuration option then. > DFSInputStream to prefer cached replica > --- > > Key: HDFS-14283 > URL: https://issues.apache.org/jira/browse/HDFS-14283 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.6.0 > Environment: HDFS Caching >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14283.001.patch, HDFS-14283.002.patch, > HDFS-14283.003.patch, HDFS-14283.004.patch > > > HDFS Caching offers performance benefits. However, currently NameNode does > not treat cached replica with higher priority, so HDFS caching is only useful > when cache replication = 3, that is to say, all replicas are cached in > memory, so that a client doesn't randomly pick an uncached replica. > HDFS-6846 proposed to let NameNode give higher priority to cached replica. > Changing a logic in NameNode is always tricky so that didn't get much > traction. Here I propose a different approach: let client (DFSInputStream) > prefer cached replica. > A {{LocatedBlock}} object already contains cached replica location so a > client has the needed information. I think we can change > {{DFSInputStream#getBestNodeDNAddrPair()}} for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972874#comment-16972874 ] Bharat Viswanadham edited comment on HDDS-2356 at 11/12/19 11:21 PM: - Hi [~timmylicheng] Thanks for sharing the logs. I see an abort multipart upload request for the key plc_1570863541668_9278 once complete multipart upload failed. {code:java} 2019-11-08 20:08:24,830 | ERROR | OMAudit | user=root | ip=9.134.50.210 | op=COMPLETE_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationIn fo=[], multipartList=[partNumber: 1 5626 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374356085" 5627 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374487158" . . 5911 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102211984655258" 5912 ]} | ret=FAILURE | INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: ozone-testkey: plc_1570863541668_9278 2019-11-08 20:08:24,963 | INFO | OMAudit | user=root | ip=9.134.50.210 | op=ABORT_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo= []} {code} And after that still, allocateBlock is continuing for the key because the entry from openKeyTable is not removed by abortMultipartUpload request.(Abort removed only entry which has been created during initiateMPU request, so that is the reason after some time you see the NO_SUCH_MULTIPART_UPLOAD error during commitMultipartUploadKey, as we removed entry from MultipartInfo table. (But the strange thing I have observed is the clientID is not matching with any of the name in the partlist, as partName lastpart is clientID.) And from the OM audit log, I see partNumber 1, and a list of multipart names, not sure if some log is truncated here. As it should show like part name, partNumber. # If you can confirm for this key what are parts in OM, you can get this from listParts(But this should be done before abort request). # Check in the OM audit log for this key what is the partlist we get, not sure in the uploaded log it is truncated. On my cluster audit logs look like below, where when completeMultipartUpload, I can see partNumber and partName.(Whereas in the uploaded log, I don't see like below) {code:java} 2019-11-12 14:57:18,580 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=INITIATE_MULTIPART_UPLOAD {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=0, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,967 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_KEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,974 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_BLOCK {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], clientID=103127415125868581} | ret=SUCCESS | 2019-11-12 14:57:54,154 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=COMMIT_MULTIPART_UPLOAD_PARTKEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[blockID { containerBlockID { containerID: 6 localID: 103127415126327331 } blockCommitSequenceId: 18 } offset: 0 length: 5242880 createVersion: 0 pipeline { leaderID: "" members { uuid: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" ipAddress: "10.65.49.251" hostName: "bh-ozone-3.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" networkLocation: "/default-rack" } members { uuid: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" ipAddress: "10.65.51.23" hostName: "bh-ozone-4.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" networkLocation: "/default-rack" } members { uuid: "cf8aace1-92b8-496e-aed9-f2771c83a56b" ipAddress: "10.65.53.160" hostName: "bh-ozone-2.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDA
[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972874#comment-16972874 ] Bharat Viswanadham edited comment on HDDS-2356 at 11/12/19 11:22 PM: - Hi [~timmylicheng] Thanks for sharing the logs. I see an abort multipart upload request for the key plc_1570863541668_9278 once complete multipart upload failed. {code:java} 2019-11-08 20:08:24,830 | ERROR | OMAudit | user=root | ip=9.134.50.210 | op=COMPLETE_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationIn fo=[], multipartList=[partNumber: 1 5626 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374356085" 5627 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102209374487158" . . 5911 partName: "/s325d55ad283aa400af464c76d713c07ad/ozone-test/plc_1570863541668_9278103102211984655258" 5912 ]} | ret=FAILURE | INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: ozone-testkey: plc_1570863541668_9278 2019-11-08 20:08:24,963 | INFO | OMAudit | user=root | ip=9.134.50.210 | op=ABORT_MULTIPART_UPLOAD {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, key=plc_1570863541668_9278, dataSize=0, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo= []} {code} And after that still, allocateBlock is continuing for the key because the entry from openKeyTable is not removed by abortMultipartUpload request.(Abort removed only entry which has been created during initiateMPU request, so that is the reason after some time you see the NO_SUCH_MULTIPART_UPLOAD error during commitMultipartUploadKey, as we removed entry from MultipartInfo table. (But the strange thing I have observed is the clientID is not matching with any of the name in the partlist, as partName lastpart is clientID.) And from the OM audit log, I see partNumber 1, and a list of multipart names, not sure if some log is truncated here. As it should show like part name, partNumber. # If you can confirm for this key what are parts in OM, you can get this from listParts(But this should be done before abort request). # Check in the OM audit log for this key what is the partlist we get, not sure in the uploaded log it is truncated. On my cluster audit logs look like below, where when completeMultipartUpload, I can see partNumber and partName.(Whereas in the uploaded log, I don't see like below) {code:java} 2019-11-12 14:57:18,580 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=INITIATE_MULTIPART_UPLOAD {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=0, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,967 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_KEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[]} | ret=SUCCESS | 2019-11-12 14:57:53,974 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=ALLOCATE_BLOCK {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], clientID=103127415125868581} | ret=SUCCESS | 2019-11-12 14:57:54,154 | INFO | OMAudit | user=root | ip=10.65.53.160 | op=COMMIT_MULTIPART_UPLOAD_PARTKEY {volume=s3dfb57b2e5f36c1f893dbc12dd66897d4, bucket=b1234, key=key123, dataSize=5242880, replicationType=RATIS, replicationFactor=ONE, keyLocationInfo=[blockID { containerBlockID { containerID: 6 localID: 103127415126327331 } blockCommitSequenceId: 18 } offset: 0 length: 5242880 createVersion: 0 pipeline { leaderID: "" members { uuid: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" ipAddress: "10.65.49.251" hostName: "bh-ozone-3.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "a5fba53c-3aa9-48a7-8272-34c606f93bc6" networkLocation: "/default-rack" } members { uuid: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" ipAddress: "10.65.51.23" hostName: "bh-ozone-4.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDALONE" value: 9859 } networkName: "5e2625aa-637e-4e5a-a0a1-6683bd108b0d" networkLocation: "/default-rack" } members { uuid: "cf8aace1-92b8-496e-aed9-f2771c83a56b" ipAddress: "10.65.53.160" hostName: "bh-ozone-2.vpc.cloudera.com" ports { name: "RATIS" value: 9858 } ports { name: "STANDA
[jira] [Created] (HDDS-2465) S3 Multipart upload failing
Bharat Viswanadham created HDDS-2465: Summary: S3 Multipart upload failing Key: HDDS-2465 URL: https://issues.apache.org/jira/browse/HDDS-2465 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Bharat Viswanadham When I run attached java program, facing below error, during completeMultipartUpload. {code:java} ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: c7b87393-955b-4c93-85f6-b02945e293ca; S3 Extended Request ID: 7tnVbqgc4bgb), S3 Extended Request ID: 7tnVbqgc4bgb at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867) at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3464) at org.apache.hadoop.ozone.freon.MPU.main(MPU.java:96){code} When I debug it is not the request is not been received by S3Gateway, and I don't see any trace of this in audit log. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2465) S3 Multipart upload failing
[ https://issues.apache.org/jira/browse/HDDS-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharat Viswanadham updated HDDS-2465: - Attachment: MPU.java > S3 Multipart upload failing > --- > > Key: HDDS-2465 > URL: https://issues.apache.org/jira/browse/HDDS-2465 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Bharat Viswanadham >Priority: Major > Attachments: MPU.java > > > When I run attached java program, facing below error, during > completeMultipartUpload. > {code:java} > ERROR StatusLogger No Log4j 2 configuration file found. Using default > configuration (logging only errors to the console), or user programmatically > provided configurations. Set system property 'log4j2.debug' to show Log4j 2 > internal initialization logging. See > https://logging.apache.org/log4j/2.x/manual/configuration.html for > instructions on how to configure Log4j 2ERROR StatusLogger No Log4j 2 > configuration file found. Using default configuration (logging only errors to > the console), or user programmatically provided configurations. Set system > property 'log4j2.debug' to show Log4j 2 internal initialization logging. See > https://logging.apache.org/log4j/2.x/manual/configuration.html for > instructions on how to configure Log4j 2Exception in thread "main" > com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: > Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: > c7b87393-955b-4c93-85f6-b02945e293ca; S3 Extended Request ID: 7tnVbqgc4bgb), > S3 Extended Request ID: 7tnVbqgc4bgb at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) > at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at > com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921) at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867) at > com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3464) > at org.apache.hadoop.ozone.freon.MPU.main(MPU.java:96){code} > When I debug it is not the request is not been received by S3Gateway, and I > don't see any trace of this in audit log. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2465) S3 Multipart upload failing
[ https://issues.apache.org/jira/browse/HDDS-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972951#comment-16972951 ] Bharat Viswanadham commented on HDDS-2465: -- cc [~elek] > S3 Multipart upload failing > --- > > Key: HDDS-2465 > URL: https://issues.apache.org/jira/browse/HDDS-2465 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Bharat Viswanadham >Priority: Major > Attachments: MPU.java > > > When I run attached java program, facing below error, during > completeMultipartUpload. > {code:java} > ERROR StatusLogger No Log4j 2 configuration file found. Using default > configuration (logging only errors to the console), or user programmatically > provided configurations. Set system property 'log4j2.debug' to show Log4j 2 > internal initialization logging. See > https://logging.apache.org/log4j/2.x/manual/configuration.html for > instructions on how to configure Log4j 2ERROR StatusLogger No Log4j 2 > configuration file found. Using default configuration (logging only errors to > the console), or user programmatically provided configurations. Set system > property 'log4j2.debug' to show Log4j 2 internal initialization logging. See > https://logging.apache.org/log4j/2.x/manual/configuration.html for > instructions on how to configure Log4j 2Exception in thread "main" > com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: > Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: > c7b87393-955b-4c93-85f6-b02945e293ca; S3 Extended Request ID: 7tnVbqgc4bgb), > S3 Extended Request ID: 7tnVbqgc4bgb at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) > at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at > com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921) at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867) at > com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3464) > at org.apache.hadoop.ozone.freon.MPU.main(MPU.java:96){code} > When I debug it is not the request is not been received by S3Gateway, and I > don't see any trace of this in audit log. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14648: --- Attachment: HDFS-14648.010.patch > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, > HDFS-14648.009.patch, HDFS-14648.010.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14612) SlowDiskReport won't update when SlowDisks is always empty in heartbeat
[ https://issues.apache.org/jira/browse/HDFS-14612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibin Huang updated HDFS-14612: Attachment: HDFS-14612-005.patch > SlowDiskReport won't update when SlowDisks is always empty in heartbeat > --- > > Key: HDFS-14612 > URL: https://issues.apache.org/jira/browse/HDFS-14612 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haibin Huang >Assignee: Haibin Huang >Priority: Major > Attachments: HDFS-14612-001.patch, HDFS-14612-002.patch, > HDFS-14612-003.patch, HDFS-14612-004.patch, HDFS-14612-005.patch, > HDFS-14612.patch > > > I found SlowDiskReport won't update when slowDisks is always empty in > org.apache.hadoop.hdfs.server.blockmanagement.*handleHeartbeat*, this may > lead to outdated SlowDiskReport alway staying in jmx of namenode until next > time slowDisks isn't empty. So i think this method > *checkAndUpdateReportIfNecessary()* should be called firstly when we want to > get the jmx information about SlowDiskReport, this can keep the > SlowDiskReport on jmx is alway valid. > > There is also some incorrect object reference on > org.apache.hadoop.hdfs.server.datanode.fsdataset. > *DataNodeVolumeMetrics* > {code:java} > // Based on writeIoRate > public long getWriteIoSampleCount() { > return syncIoRate.lastStat().numSamples(); > } > public double getWriteIoMean() { > return syncIoRate.lastStat().mean(); > } > public double getWriteIoStdDev() { > return syncIoRate.lastStat().stddev(); > } > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14969) Fix HDFS client unnecessary failover log printing
[ https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972962#comment-16972962 ] Xudong Cao commented on HDFS-14969: --- cc [~xkrogen] [~vagarychen] [~shv] [~weichiu] I feel it's not good to remove the entire log. The more appropriate way is to update the logic to be aware of how many NNs are configured. We may need to add a new method to the FailoverProxyProvider interface such as getProxiesCount() , and then implement it in all subclasses. What do you think? However, after the HDFS-14963 is merged in the future, I feel that this problem will be greatly alleviated. > Fix HDFS client unnecessary failover log printing > - > > Key: HDFS-14969 > URL: https://issues.apache.org/jira/browse/HDFS-14969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > > In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14969) Fix HDFS client unnecessary failover log printing
[ https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972962#comment-16972962 ] Xudong Cao edited comment on HDFS-14969 at 11/13/19 2:34 AM: - cc [~xkrogen] [~vagarychen] [~shv] [~weichiu] I feel it's not good to remove the entire log. The more appropriate way is to update the logic to be aware of how many NNs are configured. We may need to add a new method to the FailoverProxyProvider interface such as getProxiesCount() and implement it in all subclasses. Then We can compare the current failover count and the total number of NNs in RetryInvocationHandler to determine whether to print the failover log. What do you think? However, after the HDFS-14963 is merged in the future, I feel that this problem will be greatly alleviated. was (Author: xudongcao): cc [~xkrogen] [~vagarychen] [~shv] [~weichiu] I feel it's not good to remove the entire log. The more appropriate way is to update the logic to be aware of how many NNs are configured. We may need to add a new method to the FailoverProxyProvider interface such as getProxiesCount() , and then implement it in all subclasses. What do you think? However, after the HDFS-14963 is merged in the future, I feel that this problem will be greatly alleviated. > Fix HDFS client unnecessary failover log printing > - > > Key: HDFS-14969 > URL: https://issues.apache.org/jira/browse/HDFS-14969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > > In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14283) DFSInputStream to prefer cached replica
[ https://issues.apache.org/jira/browse/HDFS-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14283: --- Attachment: HDFS-14283.005.patch > DFSInputStream to prefer cached replica > --- > > Key: HDFS-14283 > URL: https://issues.apache.org/jira/browse/HDFS-14283 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.6.0 > Environment: HDFS Caching >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14283.001.patch, HDFS-14283.002.patch, > HDFS-14283.003.patch, HDFS-14283.004.patch, HDFS-14283.005.patch > > > HDFS Caching offers performance benefits. However, currently NameNode does > not treat cached replica with higher priority, so HDFS caching is only useful > when cache replication = 3, that is to say, all replicas are cached in > memory, so that a client doesn't randomly pick an uncached replica. > HDFS-6846 proposed to let NameNode give higher priority to cached replica. > Changing a logic in NameNode is always tricky so that didn't get much > traction. Here I propose a different approach: let client (DFSInputStream) > prefer cached replica. > A {{LocatedBlock}} object already contains cached replica location so a > client has the needed information. I think we can change > {{DFSInputStream#getBestNodeDNAddrPair()}} for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14283) DFSInputStream to prefer cached replica
[ https://issues.apache.org/jira/browse/HDFS-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972968#comment-16972968 ] Lisheng Sun edited comment on HDFS-14283 at 11/13/19 3:05 AM: -- Thanks for [~smeng] [~weichiu] [~ayushtkn] for good suggestions. i updated the patch and uploaded the v005 patch. Could you mind review it? Thank you a lot. [~weichiu][~ayushtkn] [~smeng] was (Author: leosun08): Thanks for [~smeng] [~weichiu] for good suggestions. i updated the patch and uploaded the v005 patch. Could you mind review it? Thank you a lot. [~weichiu] [~smeng] > DFSInputStream to prefer cached replica > --- > > Key: HDFS-14283 > URL: https://issues.apache.org/jira/browse/HDFS-14283 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.6.0 > Environment: HDFS Caching >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14283.001.patch, HDFS-14283.002.patch, > HDFS-14283.003.patch, HDFS-14283.004.patch, HDFS-14283.005.patch > > > HDFS Caching offers performance benefits. However, currently NameNode does > not treat cached replica with higher priority, so HDFS caching is only useful > when cache replication = 3, that is to say, all replicas are cached in > memory, so that a client doesn't randomly pick an uncached replica. > HDFS-6846 proposed to let NameNode give higher priority to cached replica. > Changing a logic in NameNode is always tricky so that didn't get much > traction. Here I propose a different approach: let client (DFSInputStream) > prefer cached replica. > A {{LocatedBlock}} object already contains cached replica location so a > client has the needed information. I think we can change > {{DFSInputStream#getBestNodeDNAddrPair()}} for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14283) DFSInputStream to prefer cached replica
[ https://issues.apache.org/jira/browse/HDFS-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972968#comment-16972968 ] Lisheng Sun commented on HDFS-14283: Thanks for [~smeng] [~weichiu] for good suggestions. i updated the patch and uploaded the v005 patch. Could you mind review it? Thank you a lot. [~weichiu] [~smeng] > DFSInputStream to prefer cached replica > --- > > Key: HDFS-14283 > URL: https://issues.apache.org/jira/browse/HDFS-14283 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.6.0 > Environment: HDFS Caching >Reporter: Wei-Chiu Chuang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14283.001.patch, HDFS-14283.002.patch, > HDFS-14283.003.patch, HDFS-14283.004.patch, HDFS-14283.005.patch > > > HDFS Caching offers performance benefits. However, currently NameNode does > not treat cached replica with higher priority, so HDFS caching is only useful > when cache replication = 3, that is to say, all replicas are cached in > memory, so that a client doesn't randomly pick an uncached replica. > HDFS-6846 proposed to let NameNode give higher priority to cached replica. > Changing a logic in NameNode is always tricky so that didn't get much > traction. Here I propose a different approach: let client (DFSInputStream) > prefer cached replica. > A {{LocatedBlock}} object already contains cached replica location so a > client has the needed information. I think we can change > {{DFSInputStream#getBestNodeDNAddrPair()}} for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972980#comment-16972980 ] Yiqun Lin commented on HDFS-14648: -- The latest patch looks great, some more comments: *ClientContext.java* We need a method to stop dead node detector thread and called this in DFSClient#close. {code:java} /** * Close dead node detector thread. */ public void stopDeadNodeDetectorThread() { if (deadNodeDetectorThr != null) { deadNodeDetectorThr.interrupt(); try { deadNodeDetectorThr.join(3000); } catch (InterruptedException e) { LOG.warn("Encountered exception while waiting to join on dead node detector thread.", e); } } } . public synchronized void close() throws IOException { if(clientRunning) { ... // close dead node detector thread clientContext.stopDeadNodeDetectorThread(); } } {code} *DFSInputStream.java* I haven't seen the call {{dfsClient.addNodeToDeadNodeDetector}} added in method {{createBlockReader}} under this class. *DFSStripedInputStream.java* Can we remove dfsClient.addNodeToDeadNodeDetector in this class? It's not expected enable dead node detection in the EC mode. {code:java} fetchBlockAt(block.getStartOffset()); - addToDeadNodes(dnInfo.info); + addToLocalDeadNodes(dnInfo.info); + dfsClient.addNodeToDeadNodeDetector(this, dnInfo.info); <=== be removed } {code} Can we also fix this whitespace warning? {noformat} ./hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDeadNodeDetection.java:113: public void testDeadNodeDetectionInMultipleDFSInputStream() {noformat} Others looks good to me now. > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, > HDFS-14648.009.patch, HDFS-14648.010.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972990#comment-16972990 ] Lisheng Sun commented on HDFS-14648: hi [~linyiqun] {quote} DFSInputStream.java I haven't seen the call dfsClient.addNodeToDeadNodeDetector added in method createBlockReader under this class {quote} i do not find the method DFSInputStream#createBlockReader. createBlockReader should be in DFSStripedInputStream. > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, > HDFS-14648.009.patch, HDFS-14648.010.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972990#comment-16972990 ] Lisheng Sun edited comment on HDFS-14648 at 11/13/19 3:52 AM: -- hi [~linyiqun] {quote} DFSInputStream.java I haven't seen the call dfsClient.addNodeToDeadNodeDetector added in method createBlockReader under this class {quote} i do not find the method DFSInputStream#createBlockReader. createBlockReader should be in DFSStripedInputStream. please correct me if i was wrong.Thank you. was (Author: leosun08): hi [~linyiqun] {quote} DFSInputStream.java I haven't seen the call dfsClient.addNodeToDeadNodeDetector added in method createBlockReader under this class {quote} i do not find the method DFSInputStream#createBlockReader. createBlockReader should be in DFSStripedInputStream. > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, > HDFS-14648.009.patch, HDFS-14648.010.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14648) DeadNodeDetector basic model
[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972994#comment-16972994 ] Yiqun Lin commented on HDFS-14648: -- [~leosun08], sorry for the confused, you are right. Please remove this change in DFSStripedInputStream and address other comments. Thanks. > DeadNodeDetector basic model > > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch, HDFS-14648.008.patch, > HDFS-14648.009.patch, HDFS-14648.010.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org