[jira] [Commented] (HDFS-15756) RBF: Cannot get updated delegation token from zookeeper
[ https://issues.apache.org/jira/browse/HDFS-15756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317683#comment-17317683 ] Fengnan Li commented on HDFS-15756: --- This was discussed in [HDFS-14405|https://issues.apache.org/jira/browse/HDFS-14405]. And yes a different storage with strong consistency to the view of clients can solve the issue. > RBF: Cannot get updated delegation token from zookeeper > --- > > Key: HDFS-15756 > URL: https://issues.apache.org/jira/browse/HDFS-15756 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.0.0 >Reporter: hbprotoss >Priority: Major > > Affected version: all version with rbf > When RBF work with spark 2.4 client mode, there will be a chance that token > is missing across different nodes in RBF cluster. The root cause is that > spark renew the token(via resource manager) immediately after got one, as > zookeeper don't have a strong consistency guarantee after an update in > cluster, zookeeper client may read a stale value in some followers not synced > with other nodes. > > We apply a patch in spark, but it is still the problem of RBF. Is it possible > for RBF to replace the delegation token store using some other > datasource(redis for example)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15423) RBF: WebHDFS create shouldn't choose DN from all sub-clusters
[ https://issues.apache.org/jira/browse/HDFS-15423?focusedWorklogId=579770=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579770 ] ASF GitHub Bot logged work on HDFS-15423: - Author: ASF GitHub Bot Created on: 09/Apr/21 05:38 Start Date: 09/Apr/21 05:38 Worklog Time Spent: 10m Work Description: fengnanli commented on pull request #2605: URL: https://github.com/apache/hadoop/pull/2605#issuecomment-816422020 @goiri Can we land this one? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579770) Time Spent: 5h 50m (was: 5h 40m) > RBF: WebHDFS create shouldn't choose DN from all sub-clusters > - > > Key: HDFS-15423 > URL: https://issues.apache.org/jira/browse/HDFS-15423 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf, webhdfs >Reporter: Chao Sun >Assignee: Fengnan Li >Priority: Major > Labels: pull-request-available > Time Spent: 5h 50m > Remaining Estimate: 0h > > In {{RouterWebHdfsMethods}} and for a {{CREATE}} call, {{chooseDatanode}} > first gets all DNs via {{getDatanodeReport}}, and then randomly pick one from > the list via {{getRandomDatanode}}. This logic doesn't seem correct as it > should pick a DN for the specific cluster(s) of the input {{path}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15887) Make LogRoll and TailEdits execute in parallel
[ https://issues.apache.org/jira/browse/HDFS-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317645#comment-17317645 ] Wei-Chiu Chuang commented on HDFS-15887: Not an expert here, but makes sense to me. > Make LogRoll and TailEdits execute in parallel > -- > > Key: HDFS-15887 > URL: https://issues.apache.org/jira/browse/HDFS-15887 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Attachments: edit_files.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > In the EditLogTailer class, LogRoll and TailEdits are executed in a thread, > and when a checkpoint occurs, it will compete with TailEdits for lock > (FSNamesystem#cpLock). > Usually, it takes a long time to execute checkpoint, which will cause the > size of the generated edit log file to be relatively large. > For example, here is an actual effect: > The StandbyCheckpointer log is triggered as follows : edit_files.jpg > 2021-03-11 09:18:42,513 [769071096]-INFO [Standby State > Checkpointer:StandbyCheckpointer$CheckpointerThread@335]-Triggering > checkpoint because there have been 5142154 txns since the last checkpoint, > which exceeds the configured threshold 100 > When loading an edit log with a large amount of data, the processing time > will be longer. We should make the edit log size as even as possible, which > is good for the operation of the system. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15243) Add an option to prevent sub-directories of protected directories from deletion
[ https://issues.apache.org/jira/browse/HDFS-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15243: --- Fix Version/s: 3.3.1 > Add an option to prevent sub-directories of protected directories from > deletion > --- > > Key: HDFS-15243 > URL: https://issues.apache.org/jira/browse/HDFS-15243 > Project: Hadoop HDFS > Issue Type: Bug > Components: 3.1.1 >Affects Versions: 3.1.1 >Reporter: liuyanyu >Assignee: liuyanyu >Priority: Major > Fix For: 3.3.1, 3.4.0 > > Attachments: HDFS-15243.001.patch, HDFS-15243.002.patch, > HDFS-15243.003.patch, HDFS-15243.004.patch, HDFS-15243.005.patch, > HDFS-15243.006.patch, image-2020-03-28-09-23-31-335.png > > > HDFS-8983 add fs.protected.directories to support protected directories on > NameNode. But as I test, when set a parent directory(eg /testA) to > protected directory, the child directory (eg /testA/testB) still can be > deleted or renamed. When we protect a directory mainly for protecting the > data under this directory , So I think the child directory should not be > delete or renamed if the parent directory is a protected directory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15887) Make LogRoll and TailEdits execute in parallel
[ https://issues.apache.org/jira/browse/HDFS-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317614#comment-17317614 ] JiangHua Zhu commented on HDFS-15887: - [~weichiu] [~hexiaoqiao], I submitted some code. Can you give me a review? Thank you very much. > Make LogRoll and TailEdits execute in parallel > -- > > Key: HDFS-15887 > URL: https://issues.apache.org/jira/browse/HDFS-15887 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Attachments: edit_files.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > In the EditLogTailer class, LogRoll and TailEdits are executed in a thread, > and when a checkpoint occurs, it will compete with TailEdits for lock > (FSNamesystem#cpLock). > Usually, it takes a long time to execute checkpoint, which will cause the > size of the generated edit log file to be relatively large. > For example, here is an actual effect: > The StandbyCheckpointer log is triggered as follows : edit_files.jpg > 2021-03-11 09:18:42,513 [769071096]-INFO [Standby State > Checkpointer:StandbyCheckpointer$CheckpointerThread@335]-Triggering > checkpoint because there have been 5142154 txns since the last checkpoint, > which exceeds the configured threshold 100 > When loading an edit log with a large amount of data, the processing time > will be longer. We should make the edit log size as even as possible, which > is good for the operation of the system. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15960) Router NamenodeHeartbeatService fails to authenticate with namenode in a kerberized envi
[ https://issues.apache.org/jira/browse/HDFS-15960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Borislav Iordanov updated HDFS-15960: - Description: We use http.hadoop.authentication.type = "kerberos" and when the NamenodeHeartbeatService calls the namenode via JMX, it is not providing a user security context so the authentication token is not transmitted and it fails. > Router NamenodeHeartbeatService fails to authenticate with namenode in a > kerberized envi > > > Key: HDFS-15960 > URL: https://issues.apache.org/jira/browse/HDFS-15960 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Borislav Iordanov >Priority: Major > > We use http.hadoop.authentication.type = "kerberos" and when the > NamenodeHeartbeatService calls the namenode via JMX, it is not providing a > user security context so the authentication token is not transmitted and it > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15960) Router NamenodeHeartbeatService fails to authenticate with namenode in a kerberized envi
Borislav Iordanov created HDFS-15960: Summary: Router NamenodeHeartbeatService fails to authenticate with namenode in a kerberized envi Key: HDFS-15960 URL: https://issues.apache.org/jira/browse/HDFS-15960 Project: Hadoop HDFS Issue Type: Bug Reporter: Borislav Iordanov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing
[ https://issues.apache.org/jira/browse/HDFS-15940?focusedWorklogId=579694=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579694 ] ASF GitHub Bot logged work on HDFS-15940: - Author: ASF GitHub Bot Created on: 09/Apr/21 02:21 Start Date: 09/Apr/21 02:21 Worklog Time Spent: 10m Work Description: tasanuma commented on pull request #2874: URL: https://github.com/apache/hadoop/pull/2874#issuecomment-816354020 Merged to trunk and cherry-picked to branch-3.3. Thanks for your PR, @virajjasani. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579694) Time Spent: 5h 20m (was: 5h 10m) > Some tests in TestBlockRecovery are consistently failing > > > Key: HDFS-15940 > URL: https://issues.apache.org/jira/browse/HDFS-15940 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Some long running tests in TestBlockRecovery are consistently failing. Also, > TestBlockRecovery is huge with so many tests, we should refactor some of long > running and race condition specific tests to separate class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing
[ https://issues.apache.org/jira/browse/HDFS-15940?focusedWorklogId=579692=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579692 ] ASF GitHub Bot logged work on HDFS-15940: - Author: ASF GitHub Bot Created on: 09/Apr/21 02:10 Start Date: 09/Apr/21 02:10 Worklog Time Spent: 10m Work Description: tasanuma merged pull request #2874: URL: https://github.com/apache/hadoop/pull/2874 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579692) Time Spent: 5h 10m (was: 5h) > Some tests in TestBlockRecovery are consistently failing > > > Key: HDFS-15940 > URL: https://issues.apache.org/jira/browse/HDFS-15940 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > Some long running tests in TestBlockRecovery are consistently failing. Also, > TestBlockRecovery is huge with so many tests, we should refactor some of long > running and race condition specific tests to separate class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15958) TestBPOfferService.testMissBlocksWhenReregister is flaky
[ https://issues.apache.org/jira/browse/HDFS-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Borislav Iordanov updated HDFS-15958: - Priority: Minor (was: Major) > TestBPOfferService.testMissBlocksWhenReregister is flaky > > > Key: HDFS-15958 > URL: https://issues.apache.org/jira/browse/HDFS-15958 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Borislav Iordanov >Priority: Minor > > This test fails relatively frequently due to a race condition. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15959) Add support to digest based authentication in ZKDelegationTokenSecretManager
Borislav Iordanov created HDFS-15959: Summary: Add support to digest based authentication in ZKDelegationTokenSecretManager Key: HDFS-15959 URL: https://issues.apache.org/jira/browse/HDFS-15959 Project: Hadoop HDFS Issue Type: Improvement Reporter: Borislav Iordanov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15621) Datanode DirectoryScanner uses excessive memory
[ https://issues.apache.org/jira/browse/HDFS-15621?focusedWorklogId=579626=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579626 ] ASF GitHub Bot logged work on HDFS-15621: - Author: ASF GitHub Bot Created on: 08/Apr/21 22:52 Start Date: 08/Apr/21 22:52 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2849: URL: https://github.com/apache/hadoop/pull/2849#issuecomment-816282010 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 51s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 35m 4s | | trunk passed | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 12s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 1s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 21s | | trunk passed | | +1 :green_heart: | javadoc | 0m 51s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 23s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 20s | | trunk passed | | +1 :green_heart: | shadedclient | 18m 44s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 14s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 6s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 6s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 55s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 14s | | the patch passed | | +1 :green_heart: | javadoc | 0m 45s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 15s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 19s | | the patch passed | | +1 :green_heart: | shadedclient | 18m 59s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 349m 5s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2849/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 36s | | The patch does not generate ASF License warnings. | | | | 442m 11s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys | | | hadoop.hdfs.server.balancer.TestBalancer | | | hadoop.hdfs.server.datanode.TestBlockScanner | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.namenode.TestFileTruncate | | | hadoop.hdfs.TestViewDistributedFileSystemWithMountLinks | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.TestPersistBlocks | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots | | | hadoop.hdfs.server.datanode.TestIncrementalBrVariations | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList | | | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes | | | hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor | | | hadoop.hdfs.server.namenode.TestDecommissioningStatus | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2849/2/artifact/out/Dockerfile | | GITHUB PR |
[jira] [Created] (HDFS-15958) TestBPOfferService.testMissBlocksWhenReregister is flaky
Borislav Iordanov created HDFS-15958: Summary: TestBPOfferService.testMissBlocksWhenReregister is flaky Key: HDFS-15958 URL: https://issues.apache.org/jira/browse/HDFS-15958 Project: Hadoop HDFS Issue Type: Bug Reporter: Borislav Iordanov This test fails relatively frequently due to a race condition. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang
[ https://issues.apache.org/jira/browse/HDFS-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-15957: -- Labels: pull-request-available (was: ) > The ignored IOException in the RPC response sent by FSEditLogAsync can cause > the HDFS client to hang > > > Key: HDFS-15957 > URL: https://issues.apache.org/jira/browse/HDFS-15957 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Critical > Labels: pull-request-available > Attachments: fsshell.txt, namenode.txt, reproduce.patch, > secondnamenode.txt > > Time Spent: 10m > Remaining Estimate: 0h > > In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, > because the possible exception (e.g., IOException) thrown in line 365 is > always ignored. > > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > class FSEditLogAsync extends FSEditLog implements Runnable { > // ... > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx); // line > 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > // the calling rpc thread will return immediately from logSync but the > // rpc response will not be sent until the edit is durable. > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse();// line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > } > } > {code} > The `call.sendResponse()` may throw an IOException. According to the > comment (“don’t care if not sent”) there, this exception is neither handled > nor printed in log. However, we suspect that some RPC responses sent there > may be critical, and there should be some retry mechanism. > We try to introduce a single IOException in line 365, and find that the > HDFS client (e.g., `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`) may get > stuck forever (hang for >30min without any log). We can reproduce this > symptom in multiple ways. One of the simplest ways of reproduction is shown > as follows: > # Start a new empty HDFS cluster (1 namenode, 2 datanodes) with the default > configuration. > # Generate a file of 15MB for testing, by `fallocate -l 1500 foo.txt`. > # Run the HDFS client `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`. > # When line 365 is invoked the third time (it is invoked 6 times in total in > this experiment), inject an IOException there. (A patch for injecting the > exception this way is attached to reproduce the issue) > Then the client hangs forever, without any log. If we run `bin/hdfs dfs > -ls /` to check the file status, we can not see the expected 15MB `/1.txt` > file. > The jstack of the HDFS client shows that there is an RPC call infinitely > waiting. > {code:java} > "Thread-6" #18 daemon prio=5 os_prio=0 tid=0x7f9cd5295800 nid=0x26b9 in > Object.wait() [0x7f9ca354f000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00071e709610> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:502) > at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1556) > - locked <0x00071e709610> (a org.apache.hadoop.ipc.Client$Call) > at
[jira] [Work logged] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang
[ https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=579611=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579611 ] ASF GitHub Bot logged work on HDFS-15957: - Author: ASF GitHub Bot Created on: 08/Apr/21 22:20 Start Date: 08/Apr/21 22:20 Worklog Time Spent: 10m Work Description: functioner opened a new pull request #2878: URL: https://github.com/apache/hadoop/pull/2878 I propose a fix for [HDFS-15957](https://issues.apache.org/jira/browse/HDFS-15957). And probably we should make `RESPONSE_SEND_RETRIES` configurable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579611) Remaining Estimate: 0h Time Spent: 10m > The ignored IOException in the RPC response sent by FSEditLogAsync can cause > the HDFS client to hang > > > Key: HDFS-15957 > URL: https://issues.apache.org/jira/browse/HDFS-15957 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs async, namenode >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Critical > Attachments: fsshell.txt, namenode.txt, reproduce.patch, > secondnamenode.txt > > Time Spent: 10m > Remaining Estimate: 0h > > In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, > because the possible exception (e.g., IOException) thrown in line 365 is > always ignored. > > {code:java} > //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java > class FSEditLogAsync extends FSEditLog implements Runnable { > // ... > @Override > public void run() { > try { > while (true) { > boolean doSync; > Edit edit = dequeueEdit(); > if (edit != null) { > // sync if requested by edit log. > doSync = edit.logEdit(); > syncWaitQ.add(edit); > } else { > // sync when editq runs dry, but have edits pending a sync. > doSync = !syncWaitQ.isEmpty(); > } > if (doSync) { > // normally edit log exceptions cause the NN to terminate, but tests > // relying on ExitUtil.terminate need to see the exception. > RuntimeException syncEx = null; > try { > logSync(getLastWrittenTxId()); > } catch (RuntimeException ex) { > syncEx = ex; > } > while ((edit = syncWaitQ.poll()) != null) { > edit.logSyncNotify(syncEx); // line > 248 > } > } > } > } catch (InterruptedException ie) { > LOG.info(Thread.currentThread().getName() + " was interrupted, > exiting"); > } catch (Throwable t) { > terminate(t); > } > } > // the calling rpc thread will return immediately from logSync but the > // rpc response will not be sent until the edit is durable. > private static class RpcEdit extends Edit { > // ... > @Override > public void logSyncNotify(RuntimeException syncEx) { > try { > if (syncEx == null) { > call.sendResponse();// line > 365 > } else { > call.abortResponse(syncEx); > } > } catch (Exception e) {} // don't care if not sent. > } > } > } > {code} > The `call.sendResponse()` may throw an IOException. According to the > comment (“don’t care if not sent”) there, this exception is neither handled > nor printed in log. However, we suspect that some RPC responses sent there > may be critical, and there should be some retry mechanism. > We try to introduce a single IOException in line 365, and find that the > HDFS client (e.g., `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`) may get > stuck forever (hang for >30min without any log). We can reproduce this > symptom in multiple ways. One of the simplest ways of reproduction is shown > as follows: > # Start a new empty HDFS cluster (1 namenode, 2 datanodes) with the default > configuration. > # Generate a file of 15MB for testing, by `fallocate -l 1500 foo.txt`. > # Run the HDFS client `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`. > # When line 365 is invoked the third time (it is invoked 6 times in total in > this experiment), inject an IOException there. (A patch for injecting the > exception this way is attached to reproduce the issue) > Then the client hangs forever, without any log. If we run `bin/hdfs
[jira] [Created] (HDFS-15957) The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang
Haoze Wu created HDFS-15957: --- Summary: The ignored IOException in the RPC response sent by FSEditLogAsync can cause the HDFS client to hang Key: HDFS-15957 URL: https://issues.apache.org/jira/browse/HDFS-15957 Project: Hadoop HDFS Issue Type: Bug Components: fs async, namenode Affects Versions: 3.2.2 Reporter: Haoze Wu Attachments: fsshell.txt, namenode.txt, reproduce.patch, secondnamenode.txt In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, because the possible exception (e.g., IOException) thrown in line 365 is always ignored. {code:java} //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java class FSEditLogAsync extends FSEditLog implements Runnable { // ... @Override public void run() { try { while (true) { boolean doSync; Edit edit = dequeueEdit(); if (edit != null) { // sync if requested by edit log. doSync = edit.logEdit(); syncWaitQ.add(edit); } else { // sync when editq runs dry, but have edits pending a sync. doSync = !syncWaitQ.isEmpty(); } if (doSync) { // normally edit log exceptions cause the NN to terminate, but tests // relying on ExitUtil.terminate need to see the exception. RuntimeException syncEx = null; try { logSync(getLastWrittenTxId()); } catch (RuntimeException ex) { syncEx = ex; } while ((edit = syncWaitQ.poll()) != null) { edit.logSyncNotify(syncEx); // line 248 } } } } catch (InterruptedException ie) { LOG.info(Thread.currentThread().getName() + " was interrupted, exiting"); } catch (Throwable t) { terminate(t); } } // the calling rpc thread will return immediately from logSync but the // rpc response will not be sent until the edit is durable. private static class RpcEdit extends Edit { // ... @Override public void logSyncNotify(RuntimeException syncEx) { try { if (syncEx == null) { call.sendResponse();// line 365 } else { call.abortResponse(syncEx); } } catch (Exception e) {} // don't care if not sent. } } } {code} The `call.sendResponse()` may throw an IOException. According to the comment (“don’t care if not sent”) there, this exception is neither handled nor printed in log. However, we suspect that some RPC responses sent there may be critical, and there should be some retry mechanism. We try to introduce a single IOException in line 365, and find that the HDFS client (e.g., `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`) may get stuck forever (hang for >30min without any log). We can reproduce this symptom in multiple ways. One of the simplest ways of reproduction is shown as follows: # Start a new empty HDFS cluster (1 namenode, 2 datanodes) with the default configuration. # Generate a file of 15MB for testing, by `fallocate -l 1500 foo.txt`. # Run the HDFS client `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`. # When line 365 is invoked the third time (it is invoked 6 times in total in this experiment), inject an IOException there. (A patch for injecting the exception this way is attached to reproduce the issue) Then the client hangs forever, without any log. If we run `bin/hdfs dfs -ls /` to check the file status, we can not see the expected 15MB `/1.txt` file. The jstack of the HDFS client shows that there is an RPC call infinitely waiting. {code:java} "Thread-6" #18 daemon prio=5 os_prio=0 tid=0x7f9cd5295800 nid=0x26b9 in Object.wait() [0x7f9ca354f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00071e709610> (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1556) - locked <0x00071e709610> (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1513) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy9.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:520) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Commented] (HDFS-15955) Make explicit_bzero cross platform
[ https://issues.apache.org/jira/browse/HDFS-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317336#comment-17317336 ] Íñigo Goiri commented on HDFS-15955: Thanks [~gautham] for the patch. Merged PR 2875. > Make explicit_bzero cross platform > -- > > Key: HDFS-15955 > URL: https://issues.apache.org/jira/browse/HDFS-15955 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs++ >Affects Versions: 3.4.0 >Reporter: Gautham Banasandra >Assignee: Gautham Banasandra >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The function explicit_bzero isn't available in Visual C++. Need to make this > cross platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15955) Make explicit_bzero cross platform
[ https://issues.apache.org/jira/browse/HDFS-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri resolved HDFS-15955. Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Make explicit_bzero cross platform > -- > > Key: HDFS-15955 > URL: https://issues.apache.org/jira/browse/HDFS-15955 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs++ >Affects Versions: 3.4.0 >Reporter: Gautham Banasandra >Assignee: Gautham Banasandra >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The function explicit_bzero isn't available in Visual C++. Need to make this > cross platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15955) Make explicit_bzero cross platform
[ https://issues.apache.org/jira/browse/HDFS-15955?focusedWorklogId=579354=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579354 ] ASF GitHub Bot logged work on HDFS-15955: - Author: ASF GitHub Bot Created on: 08/Apr/21 16:44 Start Date: 08/Apr/21 16:44 Worklog Time Spent: 10m Work Description: goiri merged pull request #2875: URL: https://github.com/apache/hadoop/pull/2875 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579354) Time Spent: 50m (was: 40m) > Make explicit_bzero cross platform > -- > > Key: HDFS-15955 > URL: https://issues.apache.org/jira/browse/HDFS-15955 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs++ >Affects Versions: 3.4.0 >Reporter: Gautham Banasandra >Assignee: Gautham Banasandra >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The function explicit_bzero isn't available in Visual C++. Need to make this > cross platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15956) Provide utility class for FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-15956?focusedWorklogId=579345=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579345 ] ASF GitHub Bot logged work on HDFS-15956: - Author: ASF GitHub Bot Created on: 08/Apr/21 16:19 Start Date: 08/Apr/21 16:19 Worklog Time Spent: 10m Work Description: virajjasani commented on pull request #2876: URL: https://github.com/apache/hadoop/pull/2876#issuecomment-815959265 I understand, some of the most critical and fundamental operations are executed by Namesystem so refactoring might make it difficult to retain clean git history, however at the same time the class might reach 10k lines of code pretty soon. Perhaps the pros of keeping clean git blame history and smooth backportings might overwhelm the cons of having ~9/10k lines of code. Let's wait at least 1 day for any further opinion? If nothing else is added, I can close the PR. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579345) Time Spent: 1h 20m (was: 1h 10m) > Provide utility class for FSNamesystem > -- > > Key: HDFS-15956 > URL: https://issues.apache.org/jira/browse/HDFS-15956 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > With ever-growing functionalities, FSNamesystem has become very huge (with > ~9k lines of code) over a period of time, we should provide a utility class > and refactor as many basic utility functions to new class as we can. > With any further suggestions, we can create sub-tasks of this Jira and work > on them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15916) DistCp: Backward compatibility: Distcp fails from Hadoop 3 to Hadoop 2 for snapshotdiff
[ https://issues.apache.org/jira/browse/HDFS-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena resolved HDFS-15916. - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Committed to trunk. Thanx [~weichiu] and [~vjasani] for the reviews, [~smajeti] for the report. Cherry-picking has issues, due to HADOOP-17482, Will wait to see if that can be backported or raise a backport PR. > DistCp: Backward compatibility: Distcp fails from Hadoop 3 to Hadoop 2 for > snapshotdiff > --- > > Key: HDFS-15916 > URL: https://issues.apache.org/jira/browse/HDFS-15916 > Project: Hadoop HDFS > Issue Type: Bug > Components: distcp >Affects Versions: 3.2.2 >Reporter: Srinivasu Majeti >Assignee: Ayush Saxena >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Looks like when using distcp diff options between two snapshots from a hadoop > 3 cluster to hadoop 2 cluster , we get below exception and seems to be break > backward compatibility due to new API introduction > getSnapshotDiffReportListing. > > {code:java} > hadoop distcp -diff s1 s2 -update src_cluster_path dst_cluster_path > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException): > Unknown method getSnapshotDiffReportListing called on > org.apache.hadoop.hdfs.protocol.ClientProtocol protocol > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15916) DistCp: Backward compatibility: Distcp fails from Hadoop 3 to Hadoop 2 for snapshotdiff
[ https://issues.apache.org/jira/browse/HDFS-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena reassigned HDFS-15916: --- Assignee: Ayush Saxena > DistCp: Backward compatibility: Distcp fails from Hadoop 3 to Hadoop 2 for > snapshotdiff > --- > > Key: HDFS-15916 > URL: https://issues.apache.org/jira/browse/HDFS-15916 > Project: Hadoop HDFS > Issue Type: Bug > Components: distcp >Affects Versions: 3.2.2 >Reporter: Srinivasu Majeti >Assignee: Ayush Saxena >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Looks like when using distcp diff options between two snapshots from a hadoop > 3 cluster to hadoop 2 cluster , we get below exception and seems to be break > backward compatibility due to new API introduction > getSnapshotDiffReportListing. > > {code:java} > hadoop distcp -diff s1 s2 -update src_cluster_path dst_cluster_path > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException): > Unknown method getSnapshotDiffReportListing called on > org.apache.hadoop.hdfs.protocol.ClientProtocol protocol > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15916) DistCp: Backward compatibility: Distcp fails from Hadoop 3 to Hadoop 2 for snapshotdiff
[ https://issues.apache.org/jira/browse/HDFS-15916?focusedWorklogId=579294=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579294 ] ASF GitHub Bot logged work on HDFS-15916: - Author: ASF GitHub Bot Created on: 08/Apr/21 15:19 Start Date: 08/Apr/21 15:19 Worklog Time Spent: 10m Work Description: ayushtkn merged pull request #2863: URL: https://github.com/apache/hadoop/pull/2863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579294) Time Spent: 0.5h (was: 20m) > DistCp: Backward compatibility: Distcp fails from Hadoop 3 to Hadoop 2 for > snapshotdiff > --- > > Key: HDFS-15916 > URL: https://issues.apache.org/jira/browse/HDFS-15916 > Project: Hadoop HDFS > Issue Type: Bug > Components: distcp >Affects Versions: 3.2.2 >Reporter: Srinivasu Majeti >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Looks like when using distcp diff options between two snapshots from a hadoop > 3 cluster to hadoop 2 cluster , we get below exception and seems to be break > backward compatibility due to new API introduction > getSnapshotDiffReportListing. > > {code:java} > hadoop distcp -diff s1 s2 -update src_cluster_path dst_cluster_path > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException): > Unknown method getSnapshotDiffReportListing called on > org.apache.hadoop.hdfs.protocol.ClientProtocol protocol > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15956) Provide utility class for FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-15956?focusedWorklogId=579285=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579285 ] ASF GitHub Bot logged work on HDFS-15956: - Author: ASF GitHub Bot Created on: 08/Apr/21 15:04 Start Date: 08/Apr/21 15:04 Worklog Time Spent: 10m Work Description: ayushtkn commented on pull request #2876: URL: https://github.com/apache/hadoop/pull/2876#issuecomment-815898530 > we do not refactored this code very much unless necessary for easier history, git blame && backport future changes I agree to this, Mostly backports would be a pain post this, So in my opinion if it isn't gonna fetch us something, let it stay as is but in case folks feels we should go ahead with this, No objections from my side, provided we check this carefully since it is touching some critical part of the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579285) Time Spent: 1h 10m (was: 1h) > Provide utility class for FSNamesystem > -- > > Key: HDFS-15956 > URL: https://issues.apache.org/jira/browse/HDFS-15956 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > With ever-growing functionalities, FSNamesystem has become very huge (with > ~9k lines of code) over a period of time, we should provide a utility class > and refactor as many basic utility functions to new class as we can. > With any further suggestions, we can create sub-tasks of this Jira and work > on them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15942) Increase Quota initialization threads
[ https://issues.apache.org/jira/browse/HDFS-15942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-15942: - Fix Version/s: 3.3.1 > Increase Quota initialization threads > - > > Key: HDFS-15942 > URL: https://issues.apache.org/jira/browse/HDFS-15942 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 3.3.1, 3.4.0 > > Attachments: HDFS-15942.001.patch > > > On large namespaces, the quota initialization at started can take a long time > with the default 4 threads. Also on NN failover, often the quota needs to be > calculated before the failover can completed, delaying the failover. > I performed some benchmarks some time back on a large image (316M inodes 35GB > on disk), the quota load takes: > {code} > quota - 4 threads 39 seconds > quota - 8 threads 23 seconds > quota - 12 threads 20 seconds > quota - 16 threads 15 seconds > {code} > As the quota is calculated when the NN is starting up (and hence doing no > other work) or at failover time before the new standby becomes active, I > think the quota should use as many threads as possible. > I proposed we change the default to 8 or 12 on at least trunk and branch-3.3 > so we have a better default going forward. > Has anyone got any other thoughts? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15937) Reduce memory used during datanode layout upgrade
[ https://issues.apache.org/jira/browse/HDFS-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell resolved HDFS-15937. -- Resolution: Fixed > Reduce memory used during datanode layout upgrade > - > > Key: HDFS-15937 > URL: https://issues.apache.org/jira/browse/HDFS-15937 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0, 3.1.4, 3.2.2, 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: heap-dump-after.png, heap-dump-before.png > > Time Spent: 2h > Remaining Estimate: 0h > > When the datanode block layout is upgrade from -56 (256x256) to -57 (32x32), > we have found the datanode uses a lot more memory than usual. > For each volume, the blocks are scanned and a list is created holding a > series of LinkArgs objects. This object contains a File object for the block > source and destination. The file object stores the path as a string, eg: > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825_1001.meta > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825 > This is string is repeated for every block and meta file on the DN, and much > of the string is the same each time, leading to a large amount of memory. > If we change the linkArgs to store: > * Src Path without the block, eg > /data01/dfs/dn/previous.tmp/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0 > * Dest Path without the block eg > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir10 > * Block / Meta file name, eg blk_12345678_1001 or blk_12345678_1001.meta > Then ensure were reuse the same file object for repeated src and dest paths, > we can save most of the memory without reworking the logic of the code. > The current logic works along the source paths recursively, so you can easily > re-use the src path object. > For the destination path, there are only 32x32 (1024) distinct paths, so we > can simply cache them in a hashMap and lookup the re-useable object each time. > I tested locally by generating 100k block files and attempting the layout > upgrade. A heap dump showed the 100k blocks using about 140MB of heap. That > is close to 1.5GB per 1M blocks. > After the change outlined above the same 100K blocks used about 20MB of heap, > so 200MB per million blocks. > A general DN sizing recommendation is 1GB of heap per 1M blocks, so the > upgrade should be able to happen within the pre-upgrade heap. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15937) Reduce memory used during datanode layout upgrade
[ https://issues.apache.org/jira/browse/HDFS-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317117#comment-17317117 ] Stephen O'Donnell commented on HDFS-15937: -- Committed this from 3.1 up to trunk. > Reduce memory used during datanode layout upgrade > - > > Key: HDFS-15937 > URL: https://issues.apache.org/jira/browse/HDFS-15937 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0, 3.1.4, 3.2.2, 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: heap-dump-after.png, heap-dump-before.png > > Time Spent: 2h > Remaining Estimate: 0h > > When the datanode block layout is upgrade from -56 (256x256) to -57 (32x32), > we have found the datanode uses a lot more memory than usual. > For each volume, the blocks are scanned and a list is created holding a > series of LinkArgs objects. This object contains a File object for the block > source and destination. The file object stores the path as a string, eg: > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825_1001.meta > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825 > This is string is repeated for every block and meta file on the DN, and much > of the string is the same each time, leading to a large amount of memory. > If we change the linkArgs to store: > * Src Path without the block, eg > /data01/dfs/dn/previous.tmp/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0 > * Dest Path without the block eg > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir10 > * Block / Meta file name, eg blk_12345678_1001 or blk_12345678_1001.meta > Then ensure were reuse the same file object for repeated src and dest paths, > we can save most of the memory without reworking the logic of the code. > The current logic works along the source paths recursively, so you can easily > re-use the src path object. > For the destination path, there are only 32x32 (1024) distinct paths, so we > can simply cache them in a hashMap and lookup the re-useable object each time. > I tested locally by generating 100k block files and attempting the layout > upgrade. A heap dump showed the 100k blocks using about 140MB of heap. That > is close to 1.5GB per 1M blocks. > After the change outlined above the same 100K blocks used about 20MB of heap, > so 200MB per million blocks. > A general DN sizing recommendation is 1GB of heap per 1M blocks, so the > upgrade should be able to happen within the pre-upgrade heap. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15937) Reduce memory used during datanode layout upgrade
[ https://issues.apache.org/jira/browse/HDFS-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-15937: - Fix Version/s: 3.2.3 3.1.5 3.4.0 3.3.1 > Reduce memory used during datanode layout upgrade > - > > Key: HDFS-15937 > URL: https://issues.apache.org/jira/browse/HDFS-15937 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0, 3.1.4, 3.2.2, 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: heap-dump-after.png, heap-dump-before.png > > Time Spent: 2h > Remaining Estimate: 0h > > When the datanode block layout is upgrade from -56 (256x256) to -57 (32x32), > we have found the datanode uses a lot more memory than usual. > For each volume, the blocks are scanned and a list is created holding a > series of LinkArgs objects. This object contains a File object for the block > source and destination. The file object stores the path as a string, eg: > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825_1001.meta > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825 > This is string is repeated for every block and meta file on the DN, and much > of the string is the same each time, leading to a large amount of memory. > If we change the linkArgs to store: > * Src Path without the block, eg > /data01/dfs/dn/previous.tmp/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0 > * Dest Path without the block eg > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir10 > * Block / Meta file name, eg blk_12345678_1001 or blk_12345678_1001.meta > Then ensure were reuse the same file object for repeated src and dest paths, > we can save most of the memory without reworking the logic of the code. > The current logic works along the source paths recursively, so you can easily > re-use the src path object. > For the destination path, there are only 32x32 (1024) distinct paths, so we > can simply cache them in a hashMap and lookup the re-useable object each time. > I tested locally by generating 100k block files and attempting the layout > upgrade. A heap dump showed the 100k blocks using about 140MB of heap. That > is close to 1.5GB per 1M blocks. > After the change outlined above the same 100K blocks used about 20MB of heap, > so 200MB per million blocks. > A general DN sizing recommendation is 1GB of heap per 1M blocks, so the > upgrade should be able to happen within the pre-upgrade heap. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing
[ https://issues.apache.org/jira/browse/HDFS-15940?focusedWorklogId=579110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579110 ] ASF GitHub Bot logged work on HDFS-15940: - Author: ASF GitHub Bot Created on: 08/Apr/21 11:39 Start Date: 08/Apr/21 11:39 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2874: URL: https://github.com/apache/hadoop/pull/2874#issuecomment-815693097 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 1s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 37m 25s | | trunk passed | | +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 24s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 3s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 34s | | trunk passed | | +1 :green_heart: | javadoc | 1m 3s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 35s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 38s | | trunk passed | | +1 :green_heart: | shadedclient | 19m 28s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 25s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 25s | | the patch passed | | +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 19s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 58s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 22s | | the patch passed | | +1 :green_heart: | javadoc | 0m 54s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 24s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 37s | | the patch passed | | +1 :green_heart: | shadedclient | 19m 7s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 354m 17s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2874/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 58s | | The patch does not generate ASF License warnings. | | | | 453m 13s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestFileTruncate | | | hadoop.hdfs.TestBlocksScheduledCounter | | | hadoop.hdfs.TestSnapshotCommands | | | hadoop.hdfs.server.datanode.TestBlockScanner | | | hadoop.hdfs.server.mover.TestMover | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor | | | hadoop.hdfs.TestStateAlignmentContextWithHA | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.server.namenode.TestDecommissioningStatus | | | hadoop.hdfs.TestHDFSFileSystemContract | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList | | | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys | | | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots | | | hadoop.hdfs.server.datanode.TestNNHandlesBlockReportPerStorage | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.TestViewDistributedFileSystemContract | | | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.TestPersistBlocks | | Subsystem | Report/Notes
[jira] [Work logged] (HDFS-15937) Reduce memory used during datanode layout upgrade
[ https://issues.apache.org/jira/browse/HDFS-15937?focusedWorklogId=579081=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579081 ] ASF GitHub Bot logged work on HDFS-15937: - Author: ASF GitHub Bot Created on: 08/Apr/21 10:59 Start Date: 08/Apr/21 10:59 Worklog Time Spent: 10m Work Description: sodonnel merged pull request #2838: URL: https://github.com/apache/hadoop/pull/2838 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579081) Time Spent: 2h (was: 1h 50m) > Reduce memory used during datanode layout upgrade > - > > Key: HDFS-15937 > URL: https://issues.apache.org/jira/browse/HDFS-15937 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0, 3.1.4, 3.2.2, 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Attachments: heap-dump-after.png, heap-dump-before.png > > Time Spent: 2h > Remaining Estimate: 0h > > When the datanode block layout is upgrade from -56 (256x256) to -57 (32x32), > we have found the datanode uses a lot more memory than usual. > For each volume, the blocks are scanned and a list is created holding a > series of LinkArgs objects. This object contains a File object for the block > source and destination. The file object stores the path as a string, eg: > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825_1001.meta > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825 > This is string is repeated for every block and meta file on the DN, and much > of the string is the same each time, leading to a large amount of memory. > If we change the linkArgs to store: > * Src Path without the block, eg > /data01/dfs/dn/previous.tmp/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0 > * Dest Path without the block eg > /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir10 > * Block / Meta file name, eg blk_12345678_1001 or blk_12345678_1001.meta > Then ensure were reuse the same file object for repeated src and dest paths, > we can save most of the memory without reworking the logic of the code. > The current logic works along the source paths recursively, so you can easily > re-use the src path object. > For the destination path, there are only 32x32 (1024) distinct paths, so we > can simply cache them in a hashMap and lookup the re-useable object each time. > I tested locally by generating 100k block files and attempting the layout > upgrade. A heap dump showed the 100k blocks using about 140MB of heap. That > is close to 1.5GB per 1M blocks. > After the change outlined above the same 100K blocks used about 20MB of heap, > so 200MB per million blocks. > A general DN sizing recommendation is 1GB of heap per 1M blocks, so the > upgrade should be able to happen within the pre-upgrade heap. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15940) Some tests in TestBlockRecovery are consistently failing
[ https://issues.apache.org/jira/browse/HDFS-15940?focusedWorklogId=579046=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579046 ] ASF GitHub Bot logged work on HDFS-15940: - Author: ASF GitHub Bot Created on: 08/Apr/21 09:57 Start Date: 08/Apr/21 09:57 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2874: URL: https://github.com/apache/hadoop/pull/2874#issuecomment-815627370 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 54s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 37m 28s | | trunk passed | | +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 3s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 34s | | trunk passed | | +1 :green_heart: | javadoc | 1m 2s | | trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 33s | | trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 41s | | trunk passed | | +1 :green_heart: | shadedclient | 19m 53s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 23s | | the patch passed | | +1 :green_heart: | compile | 1m 27s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javac | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 54s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 25s | | the patch passed | | +1 :green_heart: | javadoc | 0m 55s | | the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 | | +1 :green_heart: | javadoc | 1m 27s | | the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 42s | | the patch passed | | +1 :green_heart: | shadedclient | 18m 59s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 251m 53s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2874/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 19s | | The patch does not generate ASF License warnings. | | | | 351m 24s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks | | | hadoop.hdfs.server.namenode.TestFileTruncate | | | hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock | | | hadoop.hdfs.TestGetBlocks | | | hadoop.hdfs.server.diskbalancer.TestDiskBalancerRPC | | | hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap | | | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithOrderedSnapshotDeletion | | | hadoop.hdfs.TestClientReportBadBlock | | | hadoop.hdfs.server.namenode.snapshot.TestAclWithSnapshot | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.blockmanagement.TestErasureCodingCorruption | | | hadoop.hdfs.server.namenode.TestMetadataVersionOutput | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandbyWithQJM | | | hadoop.hdfs.server.blockmanagement.TestSlowDiskTracker | | | hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement | | |
[jira] [Updated] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15759: --- Fix Version/s: 3.3.1 > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0 > > Time Spent: 9h 10m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316997#comment-17316997 ] Max Xie edited comment on HDFS-15175 at 4/8/21, 9:01 AM: -- We encountered this bug on hdfs 3.2.1. Is there any progress now? ping [~hexiaoqiao] [~wanchang] . was (Author: max2049): ping [~hexiaoqiao] [~wanchang] . > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316997#comment-17316997 ] Max Xie commented on HDFS-15175: - ping [~hexiaoqiao] [~wanchang] . > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15788) Correct the statement for pmem cache to reflect cache persistence support
[ https://issues.apache.org/jira/browse/HDFS-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316912#comment-17316912 ] Hadoop QA commented on HDFS-15788: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 39s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} markdownlint {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} markdownlint was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 16s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 40m 15s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 15s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 4s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green}{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 60m 42s{color} | {color:black}{color} | {color:black}{color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/565/artifact/out/Dockerfile | | JIRA Issue | HDFS-15788 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13023533/HDFS-15788-02.patch | | Optional Tests | dupname asflicense mvnsite markdownlint | | uname | Linux 956aac858d5d 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / ae88174c29a | | Max. process+thread count | 554 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/565/console | | versions | git=2.25.1 maven=3.6.3 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Correct the statement for pmem cache to reflect cache persistence support > - > > Key: HDFS-15788 > URL: https://issues.apache.org/jira/browse/HDFS-15788 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Affects Versions: 3.4.0 >Reporter: Feilong He >Assignee: Feilong He >Priority: Minor > Attachments: HDFS-15788-01.patch, HDFS-15788-02.patch > > > Correct the statement for pmem cache to reflect cache persistence support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15788) Correct the statement for pmem cache to reflect cache persistence support
[ https://issues.apache.org/jira/browse/HDFS-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316901#comment-17316901 ] Feilong He commented on HDFS-15788: --- Hi [~ayushtkn], sorry for this late reply. This issue is relevant to HDFS-14740 which has already been resolved in 3.3.0. We proposed this current Jira to update document to align with the code changes we made. The target of this Jira is 3.3.1 & 3.4.0. > Correct the statement for pmem cache to reflect cache persistence support > - > > Key: HDFS-15788 > URL: https://issues.apache.org/jira/browse/HDFS-15788 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Affects Versions: 3.4.0 >Reporter: Feilong He >Assignee: Feilong He >Priority: Minor > Attachments: HDFS-15788-01.patch, HDFS-15788-02.patch > > > Correct the statement for pmem cache to reflect cache persistence support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15788) Correct the statement for pmem cache to reflect cache persistence support
[ https://issues.apache.org/jira/browse/HDFS-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feilong He updated HDFS-15788: -- Target Version/s: 3.3.1, 3.4.0 (was: 3.3.1, 3.4.0, 3.1.5, 3.2.3) > Correct the statement for pmem cache to reflect cache persistence support > - > > Key: HDFS-15788 > URL: https://issues.apache.org/jira/browse/HDFS-15788 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Affects Versions: 3.4.0 >Reporter: Feilong He >Assignee: Feilong He >Priority: Minor > Attachments: HDFS-15788-01.patch, HDFS-15788-02.patch > > > Correct the statement for pmem cache to reflect cache persistence support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15956) Provide utility class for FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-15956?focusedWorklogId=578931=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-578931 ] ASF GitHub Bot logged work on HDFS-15956: - Author: ASF GitHub Bot Created on: 08/Apr/21 06:39 Start Date: 08/Apr/21 06:39 Worklog Time Spent: 10m Work Description: virajjasani edited a comment on pull request #2876: URL: https://github.com/apache/hadoop/pull/2876#issuecomment-815454130 I see, open for opinions. Since I saw ~9k lines of code, I thought of at least refactoring util functions which can be **static** and not require basic **Namespace tree specific** logic internally (e.g BlockManager, SnapshotManager, Namesystem read-write locking related logic) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 578931) Time Spent: 1h (was: 50m) > Provide utility class for FSNamesystem > -- > > Key: HDFS-15956 > URL: https://issues.apache.org/jira/browse/HDFS-15956 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > With ever-growing functionalities, FSNamesystem has become very huge (with > ~9k lines of code) over a period of time, we should provide a utility class > and refactor as many basic utility functions to new class as we can. > With any further suggestions, we can create sub-tasks of this Jira and work > on them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15788) Correct the statement for pmem cache to reflect cache persistence support
[ https://issues.apache.org/jira/browse/HDFS-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feilong He updated HDFS-15788: -- Attachment: HDFS-15788-02.patch > Correct the statement for pmem cache to reflect cache persistence support > - > > Key: HDFS-15788 > URL: https://issues.apache.org/jira/browse/HDFS-15788 > Project: Hadoop HDFS > Issue Type: Bug > Components: documentation >Affects Versions: 3.4.0 >Reporter: Feilong He >Assignee: Feilong He >Priority: Minor > Attachments: HDFS-15788-01.patch, HDFS-15788-02.patch > > > Correct the statement for pmem cache to reflect cache persistence support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org